Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs
By: Xikang Yang , Biyu Zhou , Xuehai Tang and more
Potential Business Impact:
Breaks AI safety rules using mind tricks.
Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks that exploit cognitive biases -- systematic deviations from rational judgment. Unlike prior jailbreaking approaches focused on prompt engineering or algorithmic manipulation, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. We propose CognitiveAttack, a novel red-teaming framework that systematically leverages both individual and combined cognitive biases. By integrating supervised fine-tuning and reinforcement learning, CognitiveAttack generates prompts that embed optimized bias combinations, effectively bypassing safety protocols while maintaining high attack success rates. Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models. CognitiveAttack achieves a substantially higher attack success rate compared to the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defense mechanisms. These findings highlight multi-bias interactions as a powerful yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.
Similar Papers
Bias Beware: The Impact of Cognitive Biases on LLM-Driven Product Recommendations
Computation and Language
Makes product ads trick computers into recommending them.
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs
Computation and Language
Makes AI say bad things by tricking its rules.
Bias Beware: The Impact of Cognitive Biases on LLM-Driven Product Recommendations
Computation and Language
Tricks computers into recommending products better.