Score: 0

Active Learning for Direct Preference Optimization

Published: March 3, 2025 | arXiv ID: 2503.01076v1

By: Branislav Kveton , Xintong Li , Julian McAuley and more

Potential Business Impact:

Teaches AI to learn faster from human choices.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Direct preference optimization (DPO) is a form of reinforcement learning from human feedback (RLHF) where the policy is learned directly from preferential feedback. Although many models of human preferences exist, the critical task of selecting the most informative feedback for training them is under-explored. We propose an active learning framework for DPO, which can be applied to collect human feedback online or to choose the most informative subset of already collected feedback offline. We propose efficient algorithms for both settings. The key idea is to linearize the DPO objective at the last layer of the neural network representation of the optimized policy and then compute the D-optimal design to collect preferential feedback. We prove that the errors in our DPO logit estimates diminish with more feedback. We show the effectiveness of our algorithms empirically in the setting that matches our theory and also on large language models.

A Survey of Direct Preference Optimization

Machine Learning (CS)

Teaches computers to be helpful and safe.

12 Mar 2025 3

92%

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

Computation and Language

Makes AI learn better from what people like.

22 Apr 2025 1

91%

Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

Artificial Intelligence

Teaches AI to understand many different opinions.

17 Oct 2025 0

View PDF Login to Bookmark

Page Count

17 pages

Active Learning for Direct Preference Optimization

Teaches AI to learn faster from human choices.

Technical Abstract

A Survey of Direct Preference Optimization

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences