Score: 0

Beyond Single-Reward: Multi-Pair, Multi-Perspective Preference Optimization for Machine Translation

Published: October 15, 2025 | arXiv ID: 2510.13434v1

By: Hao Wang , Linlong Xu , Heng Liu and more

Potential Business Impact:

Teaches computers to translate languages better.

Business Areas:

A/B Testing Data and Analytics

Direct Preference Optimization (DPO) is a powerful paradigm for aligning Large Language Models (LLMs) to human preferences in Machine Translation (MT), but current methods are hindered by two fundamental challenges: (1) flawed reward signals from Quality Estimation (QE) models that overlook critical errors like translation hallucination, and (2) inefficient data utilization that discards valuable learning signals by selecting only a single win-loss pair. To address these limitations, we introduce M^2PO: Multi-Pair, Multi-Perspective Preference Optimization. Our framework integrates a multi-perspective reward engine that creates a more robust signal by combining two key viewpoints: a new hallucination penalty for factuality, and an innovative dynamic quality score that adaptively fuses external evaluations with the model's own evolving judgment. This is synergistically paired with a multi-pair construction strategy that systematically creates a comprehensive set of preference pairs from the entire pool of translation candidates. This synergistic approach ensures the model learns from a richer spectrum of quality trade-offs, leading to more robust and faithful translations. On challenging WMT21-22 benchmarks, M^2PO substantially outperforms existing preference optimization methods and demonstrates highly competitive performance against leading proprietary LLMs.

Beyond Single: A Data Selection Principle for LLM Alignment via Fine-Grained Preference Signals

Machine Learning (CS)

Teaches AI to follow many different rules better.

11 Aug 2025 0

91%

Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment

Computation and Language

Teaches AI to understand more languages from English.

6 Mar 2025 1

91%

Robust Multi-Objective Preference Alignment with Online DPO

Computation and Language

Lets AI learn many different human wishes.

1 Mar 2025 1

View PDF Login to Bookmark

Page Count

13 pages

Beyond Single-Reward: Multi-Pair, Multi-Perspective Preference Optimization for Machine Translation

Teaches computers to translate languages better.

Technical Abstract

Beyond Single: A Data Selection Principle for LLM Alignment via Fine-Grained Preference Signals

Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment

Robust Multi-Objective Preference Alignment with Online DPO