When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
By: Kai Wang, Yihao Zhang, Meng Sun
Potential Business Impact:
Teaches AI to tell the truth, not lie.
The honesty of large language models (LLMs) is a critical alignment challenge, especially as advanced systems with chain-of-thought (CoT) reasoning may strategically deceive humans. Unlike traditional honesty issues on LLMs, which could be possibly explained as some kind of hallucination, those models' explicit thought paths enable us to study strategic deception--goal-driven, intentional misinformation where reasoning contradicts outputs. Using representation engineering, we systematically induce, detect, and control such deception in CoT-enabled LLMs, extracting "deception vectors" via Linear Artificial Tomography (LAT) for 89% detection accuracy. Through activation steering, we achieve a 40% success rate in eliciting context-appropriate deception without explicit prompts, unveiling the specific honesty-related issue of reasoning models and providing tools for trustworthy AI alignment.
Similar Papers
Can LLMs Lie? Investigation beyond Hallucination
Machine Learning (CS)
Teaches AI to lie or tell truth.
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts
Machine Learning (CS)
Finds when AI lies about hard problems.
Do Large Language Models Exhibit Spontaneous Rational Deception?
Computation and Language
Smart AI sometimes lies when it benefits them.