Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models
By: Sanskar Pandey , Ruhaan Chopra , Angkul Puniya and more
Potential Business Impact:
Makes AI tell the truth, not just agree.
Large language models internalize a structural trade-off between truthfulness and obsequious flattery, emerging from reward optimization that conflates helpfulness with polite submission. This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning. We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context, enabling precise measurement of the tension between factual accuracy and submissive bias. Evaluations across twelve state-of-the-art models reveal that sycophancy decomposes into stable linguistic and affective sub-biases, each scaling with model capacity. We further propose prompt-level and activation-level interventions that modulate these biases in opposing directions, exposing the internal geometry of alignment as a dynamic manifold between truthfulness and socially compliant judgment. Beacon reframes sycophancy as a measurable form of normative misgeneralization, providing a reproducible foundation for studying and mitigating alignment drift in large-scale generative systems.
Similar Papers
TRUTH DECAY: Quantifying Multi-Turn Sycophancy in Language Models
Computation and Language
Makes AI tell the truth, even when you argue.
When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models
Computation and Language
Makes AI agree with you, even if wrong.
When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models
Computation and Language
Fixes AI that agrees with you too much.