Listen

Description

This paper provides a theoretical and empirical analysis of **on-policy preference learning**, a method used to align large language models with human values. The authors introduce the **coverage improvement principle**, demonstrating that updating a model using its own generated data—rather than static, offline datasets—creates a feedback loop that makes subsequent data increasingly informative. This process allows **on-policy Direct Preference Optimization (DPO)** to achieve **exponentially faster convergence** and lower sample complexity compared to traditional offline approaches. To further optimize this alignment, the researchers propose a **hybrid sampler** based on a novel **preferential G-optimal design** that can guarantee convergence in only two training rounds. Additionally, they develop **reward distillation schemes** that utilize relative reward signals to achieve even faster learning rates than standard preference-based methods. Experimental results on **summarization and chat tasks** confirm that these on-policy techniques yield stable, monotonic performance gains while avoiding the degradation often seen in offline models.