Score: 1

More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment

Published: April 3, 2025 | arXiv ID: 2504.02193v3

By: Yifan Wang , Runjin Chen , Bolian Li and more

Potential Business Impact:

Makes AI safer by avoiding bad advice.

Business Areas:

A/B Testing Data and Analytics

Aligning large language models (LLMs) with human values is an increasingly critical step in post-training. Direct Preference Optimization (DPO) has emerged as a simple, yet effective alternative to reinforcement learning from human feedback (RLHF). Synthetic preference data with its low cost and high quality enable effective alignment through single- or multi-model generated preference data. Our study reveals a striking, safety-specific phenomenon associated with DPO alignment: Although multi-model generated data enhances performance on general tasks (ARC, Hellaswag, MMLU, TruthfulQA, Winogrande) by providing diverse responses, it also tends to facilitate reward hacking during training. This can lead to a high attack success rate (ASR) when models encounter jailbreaking prompts. The issue is particularly pronounced when employing stronger models like GPT-4o or larger models in the same family to generate chosen responses paired with target model self-generated rejected responses, resulting in dramatically poorer safety outcomes. Furthermore, with respect to safety, using solely self-generated responses (single-model generation) for both chosen and rejected pairs significantly outperforms configurations that incorporate responses from stronger models, whether used directly as chosen data or as part of a multi-model response pool. We demonstrate that multi-model preference data exhibits high linear separability between chosen and rejected responses, which allows models to exploit superficial cues rather than internalizing robust safety constraints. Our experiments, conducted on models from the Llama, Mistral, and Qwen families, consistently validate these findings.

Beyond Single: A Data Selection Principle for LLM Alignment via Fine-Grained Preference Signals

Machine Learning (CS)

Teaches AI to follow many different rules better.

11 Aug 2025 0

91%

Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling

Computation and Language

Makes AI safer and cheaper to train.

13 Mar 2025 0

91%

Primal-Dual Direct Preference Optimization for Constrained LLM Alignment

Machine Learning (CS)

Makes AI safer and cheaper to train.

7 Oct 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

22 pages

More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment

Makes AI safer by avoiding bad advice.

Technical Abstract

Beyond Single: A Data Selection Principle for LLM Alignment via Fine-Grained Preference Signals

Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling

Primal-Dual Direct Preference Optimization for Constrained LLM Alignment