Score: 0

Security Steerability is All You Need

Published: April 28, 2025 | arXiv ID: 2504.19521v4

By: Itay Hazan , Idan Habler , Ron Bitton and more

Potential Business Impact:

Makes AI follow rules to stop bad questions.

Business Areas:

Intelligent Systems Artificial Intelligence, Data and Analytics, Science and Engineering

The adoption of Generative AI (GenAI) in applications inevitably comes with the expansion of the attack surface, combining new security threats along with the traditional ones. Consequently, numerous research and industrial initiatives aim to mitigate the GenAI related security threats by developing evaluation methods and designing defenses. However, while most of the GenAI security work focuses on universal threats (e.g. 'How to build a bomb'), there is significantly less discussion on application-level security and how to evaluate and mitigate it. Thus, in this work we adopt an application-centric approach to GenAI security, and show that while LLMs cannot protect against ad-hoc application specific threats, they can provide the framework for applications to protect themselves against such threats. Our first contribution is defining Security Steerability - a novel security measure for LLMs, assessing the model's capability to adhere to strict guardrails that are defined in the system prompt (e.g. 'Refrain from discussing about our competitors'). These guardrails, in case effective, can stop threats in the presence of malicious users who attempt to circumvent the application purpose. Our second contribution is a methodology to measure the security steerability of LLMs, utilizing a newly-developed benchmark called VeganRibs which assesses the LLM behavior in forcing specific guardrails that are not security per-se, in the presence of malicious user that tries to bypass the guardrails through prompt injection attacks with attack boosters (jailbreaks and perturbations). Using the new benchmark, we analyzed 18 open-source LLMs, demonstrating significant differences between their security steerability that are not trivial to foresee...