Score: 0

InSPO: Unlocking Intrinsic Self-Reflection for LLM Preference Optimization

Published: December 29, 2025 | arXiv ID: 2512.23126v1

By: Yu Li, Tian Lan, Zhengling Qi

Direct Preference Optimization (DPO) and its variants have become standard for aligning Large Language Models due to their simplicity and offline stability. However, we identify two fundamental limitations. First, the optimal policy depends on arbitrary modeling choices (scalarization function, reference policy), yielding behavior reflecting parameterization artifacts rather than true preferences. Second, treating response generation in isolation fails to leverage comparative information in pairwise data, leaving the model's capacity for intrinsic self-reflection untapped. To address it, we propose Intrinsic Self-reflective Preference Optimization (\q), deriving a globally optimal policy conditioning on both context and alternative responses. We prove this formulation superior to DPO/RLHF while guaranteeing invariance to scalarization and reference choices. \q~serves as a plug-and-play enhancement without architectural changes or inference overhead. Experiments demonstrate consistent improvements in win rates and length-controlled metrics, validating that unlocking self-reflection yields more robust, human-aligned LLMs.

Reflective Preference Optimization (RPO): Enhancing On-Policy Alignment via Hint-Guided Reflection

Artificial Intelligence

Makes AI better by teaching it to fix its own mistakes.

15 Dec 2025 2

91%

Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

Artificial Intelligence

Teaches AI to understand many different opinions.

17 Oct 2025 0

91%

Explicit Preference Optimization: No Need for an Implicit Reward Model

Machine Learning (CS)

Makes AI learn better without extra steps.

9 Jun 2025 2

View PDF Login to Bookmark

InSPO: Unlocking Intrinsic Self-Reflection for LLM Preference Optimization

Technical Abstract

Reflective Preference Optimization (RPO): Enhancing On-Policy Alignment via Hint-Guided Reflection

Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences

Explicit Preference Optimization: No Need for an Implicit Reward Model