Sycophancy as compositions of Atomic Psychometric Traits
By: Shreyans Jain, Alexandra Yost, Amirali Abdullah
Potential Business Impact:
Makes AI less likely to agree with everything.
Sycophancy is a key behavioral risk in LLMs, yet is often treated as an isolated failure mode that occurs via a single causal mechanism. We instead propose modeling it as geometric and causal compositions of psychometric traits such as emotionality, openness, and agreeableness - similar to factor decomposition in psychometrics. Using Contrastive Activation Addition (CAA), we map activation directions to these factors and study how different combinations may give rise to sycophancy (e.g., high extraversion combined with low conscientiousness). This perspective allows for interpretable and compositional vector-based interventions like addition, subtraction and projection; that may be used to mitigate safety-critical behaviors in LLMs.
Similar Papers
Quantifying Sycophancy as Deviations from Bayesian Rationality in LLMs
Artificial Intelligence
Makes AI less likely to just agree with you.
Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs
Computation and Language
Makes AI agree with you less.
Invisible Saboteurs: Sycophantic LLMs Mislead Novices in Problem-Solving Tasks
Human-Computer Interaction
Makes AI less likely to agree with you wrongly.