Know Thy Enemy: Securing LLMs Against Prompt Injection via Diverse Data Synthesis and Instruction-Level Chain-of-Thought Learning
By: Zhiyuan Chang , Mingyang Li , Yuekai Huang and more
Potential Business Impact:
Blocks bad instructions from tricking AI.
Large language model (LLM)-integrated applications have become increasingly prevalent, yet face critical security vulnerabilities from prompt injection (PI) attacks. Defending against PI attacks faces two major issues: malicious instructions can be injected through diverse vectors, and injected instructions often lack clear semantic boundaries from the surrounding context, making them difficult to identify. To address these issues, we propose InstruCoT, a model enhancement method for PI defense that synthesizes diverse training data and employs instruction-level chain-of-thought fine-tuning, enabling LLMs to effectively identify and reject malicious instructions regardless of their source or position in the context. We evaluate InstruCoT across three critical dimensions: Behavior Deviation, Privacy Leakage, and Harmful Output. Experimental results across four LLMs demonstrate that InstruCoT significantly outperforms baselines in all dimensions while maintaining utility performance without degradation
Similar Papers
Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis
Cryptography and Security
Stops AI from following secret bad commands.
Multimodal Prompt Injection Attacks: Risks and Defenses for Modern LLMs
Cryptography and Security
Finds ways AI can be tricked.
Beyond the Benchmark: Innovative Defenses Against Prompt Injection Attacks
Cryptography and Security
Stops tricky instructions from tricking AI.