Score: 0

Coverage Improvement and Fast Convergence of On-policy Preference Learning

Published: January 13, 2026 | arXiv ID: 2601.08421v1

By: Juno Kim , Jihun Yun , Jason D. Lee and more

Online on-policy preference learning algorithms for language model alignment such as online direct policy optimization (DPO) can significantly outperform their offline counterparts. We provide a theoretical explanation for this phenomenon by analyzing how the sampling policy's coverage evolves throughout on-policy training. We propose and rigorously justify the \emph{coverage improvement principle}: with sufficient batch size, each update moves into a region around the target where coverage is uniformly better, making subsequent data increasingly informative and enabling rapid convergence. In the contextual bandit setting with Bradley-Terry preferences and linear softmax policy class, we show that on-policy DPO converges exponentially in the number of iterations for batch size exceeding a generalized coverage threshold. In contrast, any learner restricted to offline samples from the initial policy suffers a slower minimax rate, leading to a sharp separation in total sample complexity. Motivated by this analysis, we further propose a simple hybrid sampler based on a novel \emph{preferential} G-optimal design, which removes dependence on coverage and guarantees convergence in just two rounds. Finally, we develop principled on-policy schemes for reward distillation in the general function class setting, and show faster noiseless rates under an alternative deviation-based notion of coverage. Experimentally, we confirm that on-policy DPO and our proposed reward distillation algorithms outperform their off-policy counterparts and enjoy stable, monotonic performance gains across iterations.

Preference Optimization by Estimating the Ratio of the Data Distribution

Machine Learning (CS)

Makes AI better at following instructions.

26 May 2025 2

90%

Beyond Reward Margin: Rethinking and Resolving Likelihood Displacement in Diffusion Models via Video Generation

CV and Pattern Recognition

Makes AI videos better by learning what people like.

24 Nov 2025 1

89%

Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment

Machine Learning (CS)

Makes AI better by fixing its mistakes.

7 Oct 2025 2

View PDF Login to Bookmark

Coverage Improvement and Fast Convergence of On-policy Preference Learning

Technical Abstract

Preference Optimization by Estimating the Ratio of the Data Distribution

Beyond Reward Margin: Rethinking and Resolving Likelihood Displacement in Diffusion Models via Video Generation

Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment