Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs
By: Daniel Vennemeyer , Phan Anh Duong , Tiffany Zhan and more
Potential Business Impact:
Makes AI agree with you less.
Large language models (LLMs) often exhibit sycophantic behaviors -- such as excessive agreement with or flattery of the user -- but it is unclear whether these behaviors arise from a single mechanism or multiple distinct processes. We decompose sycophancy into sycophantic agreement and sycophantic praise, contrasting both with genuine agreement. Using difference-in-means directions, activation additions, and subspace geometry across multiple models and datasets, we show that: (1) the three behaviors are encoded along distinct linear directions in latent space; (2) each behavior can be independently amplified or suppressed without affecting the others; and (3) their representational structure is consistent across model families and scales. These results suggest that sycophantic behaviors correspond to distinct, independently steerable representations.
Similar Papers
When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models
Computation and Language
Makes AI agree with you, even if wrong.
When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models
Computation and Language
Fixes AI that agrees with you too much.
Sycophancy as compositions of Atomic Psychometric Traits
Artificial Intelligence
Makes AI less likely to agree with everything.