Intelligently Weighting Multiple Reference Models for Direct Preference Optimization of LLMs
By: Skyler Wu, Aymen Echarghaoui
Potential Business Impact:
Makes AI learn better from many examples.
Fine-tuning is integral for aligning large language models (LLMs) with human preferences. Multiple-Reference Preference Optimization (MRPO) builds on Direct Preference Optimization (DPO) by fine-tuning LLMs on preference datasets while regularizing the policy towards a mixture of reference models to leverage their collective desirable properties. However, current methods for setting the reference weights are ad-hoc and statistically unsound, leading to unreliable performance. To address this, we introduce four new weighting strategies: two offline methods that leverage held-out validation signal; one online method that uses a sliding-window estimator to reduce overfitting; and an online method that treats reference weighting as a $K$-armed bandit via Thompson Sampling. Experiments using Qwen2.5-0.5B as the policy model and seven reference models from the Llama, Mistral, Qwen, Yi, and Phi families (0.5B-14B each) show that all 4 of our strategies outperform the current MRPO weighting methods on UltraFeedback and SafeRLHF in preference accuracy. More thought-provokingly, however, we find that single-reference DPO, using any of 6 out of 7 references, consistently outperforms all tested multiple-reference approaches -- calling into question the practical appeal of multiple-reference approaches.
Similar Papers
Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model
Computation and Language
Makes AI learn better from what people like.
Lightweight Robust Direct Preference Optimization
Machine Learning (CS)
Makes AI learn better from messy human feedback.
Beyond Single: A Data Selection Principle for LLM Alignment via Fine-Grained Preference Signals
Machine Learning (CS)
Teaches AI to follow many different rules better.