Adversarial Contrastive Learning for LLM Quantization Attacks
By: Dinghong Song , Zhiwei Xu , Hai Wan and more
Potential Business Impact:
Makes AI models say bad things after changes.
Model quantization is critical for deploying large language models (LLMs) on resource-constrained hardware, yet recent work has revealed severe security risks that benign LLMs in full precision may exhibit malicious behaviors after quantization. In this paper, we propose Adversarial Contrastive Learning (ACL), a novel gradient-based quantization attack that achieves superior attack effectiveness by explicitly maximizing the gap between benign and harmful responses probabilities. ACL formulates the attack objective as a triplet-based contrastive loss, and integrates it with a projected gradient descent two-stage distributed fine-tuning strategy to ensure stable and efficient optimization. Extensive experiments demonstrate ACL's remarkable effectiveness, achieving attack success rates of 86.00% for over-refusal, 97.69% for jailbreak, and 92.40% for advertisement injection, substantially outperforming state-of-the-art methods by up to 44.67%, 18.84%, and 50.80%, respectively.
Similar Papers
AQUA-LLM: Evaluating Accuracy, Quantization, and Adversarial Robustness Trade-offs in LLMs for Cybersecurity Question Answering
Cryptography and Security
Makes smart computer security programs smaller, faster, safer.
Critical Evaluation of Quantum Machine Learning for Adversarial Robustness
Cryptography and Security
Makes quantum computers safer from hackers.
Critical Evaluation of Quantum Machine Learning for Adversarial Robustness
Cryptography and Security
Makes quantum computers safer from hackers.