Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks
By: Sizhe Chen , Arman Zharmagambetov , David Wagner and more
Potential Business Impact:
Makes AI safer from sneaky tricks.
Prompt injection attacks pose a significant security threat to LLM-integrated applications. Model-level defenses have shown strong effectiveness, but are currently deployed into commercial-grade models in a closed-source manner. We believe open-source models are needed by the AI security community, where co-development of attacks and defenses through open research drives scientific progress in mitigation against prompt injection attacks. To this end, we develop Meta SecAlign, the first open-source and open-weight LLM with built-in model-level defense that achieves commercial-grade model performance. We provide complete details of our training recipe, which utilizes an improved version of the SOTA SecAlign defense. Evaluations on 9 utility benchmarks and 7 security benchmarks show that Meta SecAlign, despite being trained on a generic instruction-tuning dataset, confers security in unseen downstream tasks, including tool-calling and agentic web navigation, in addition general instruction-following. Our best model -- Meta-SecAlign-70B -- achieves state-of-the-art robustness against prompt injection attacks and comparable utility to closed-source commercial LLM with model-level defense.
Similar Papers
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack
Cryptography and Security
Protects smart computer helpers from bad instructions.
Lifelong Safety Alignment for Language Models
Cryptography and Security
Teaches AI to block new tricks to trick it.
Beyond the Benchmark: Innovative Defenses Against Prompt Injection Attacks
Cryptography and Security
Stops tricky instructions from tricking AI.