The Facade of Truth: Uncovering and Mitigating LLM Susceptibility to Deceptive Evidence
By: Herun Wan , Jiaying Wu , Minnan Luo and more
Potential Business Impact:
Makes AI believe fake facts, but a new fix helps.
To reliably assist human decision-making, LLMs must maintain factual internal beliefs against misleading injections. While current models resist explicit misinformation, we uncover a fundamental vulnerability to sophisticated, hard-to-falsify evidence. To systematically probe this weakness, we introduce MisBelief, a framework that generates misleading evidence via collaborative, multi-round interactions among multi-role LLMs. This process mimics subtle, defeasible reasoning and progressive refinement to create logically persuasive yet factually deceptive claims. Using MisBelief, we generate 4,800 instances across three difficulty levels to evaluate 7 representative LLMs. Results indicate that while models are robust to direct misinformation, they are highly sensitive to this refined evidence: belief scores in falsehoods increase by an average of 93.0\%, fundamentally compromising downstream recommendations. To address this, we propose Deceptive Intent Shielding (DIS), a governance mechanism that provides an early warning signal by inferring the deceptive intent behind evidence. Empirical results demonstrate that DIS consistently mitigates belief shifts and promotes more cautious evidence evaluation.
Similar Papers
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts
Machine Learning (CS)
Finds when AI lies about hard problems.
Can LLMs Lie? Investigation beyond Hallucination
Machine Learning (CS)
Teaches AI to lie or tell truth.
Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage
Computation and Language
Makes AI believe lies using only true facts.