Score: 0

C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs

Published: December 29, 2025 | arXiv ID: 2512.23430v1

By: Xuan Feng , Bo An , Tianlong Gu and more

Potential Business Impact:

Fixes AI's unfair thinking and keeps it smart.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Bias in Large Language Models (LLMs) poses significant risks to trustworthiness, manifesting primarily as stereotypical biases (e.g., gender or racial stereotypes) and structural biases (e.g., lexical overlap or position preferences). However, prior paradigms typically address these in isolation, often mitigating one at the expense of exacerbating the other. To address this, we conduct a systematic exploration of these reasoning failures and identify a primary inducement: the latent spurious feature correlations within the input that drive these erroneous reasoning shortcuts. Driven by these findings, we introduce Causal-Contrastive Preference Optimization (C2PO), a unified alignment framework designed to tackle these specific failures by simultaneously discovering and suppressing these correlations directly within the optimization process. Specifically, C2PO leverages causal counterfactual signals to isolate bias-inducing features from valid reasoning paths, and employs a fairness-sensitive preference update mechanism to dynamically evaluate logit-level contributions and suppress shortcut features. Extensive experiments across multiple benchmarks covering stereotypical bias (BBQ, Unqover), structural bias (MNLI, HANS, Chatbot, MT-Bench), out-of-domain fairness (StereoSet, WinoBias), and general utility (MMLU, GSM8K) demonstrate that C2PO effectively mitigates stereotypical and structural biases while preserving robust general reasoning capabilities.

Data-Efficient Domain Adaptation for LLM-based MT using Contrastive Preference Optimization

Computation and Language

Teaches computers new skills with less training data.

31 Oct 2025 0

89%

Compositional Bias Control in Large Language Models: Preference Learning Fails, Supervision Succeeds

Computation and Language

Makes AI write fair stories, not biased ones.

24 Oct 2025 0

88%

When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets

Computation and Language

Makes AI understand what you like better.

14 Nov 2025 4

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

19 pages

C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs

Fixes AI's unfair thinking and keeps it smart.

Technical Abstract

Data-Efficient Domain Adaptation for LLM-based MT using Contrastive Preference Optimization

Compositional Bias Control in Large Language Models: Preference Learning Fails, Supervision Succeeds

When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets