Score: 0

RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts

Published: October 6, 2025 | arXiv ID: 2510.05310v1

By: Yining She , Daniel W. Peterson , Marianne Menglin Liu and more

Potential Business Impact:

Makes AI safety checks unreliable with extra information.

Business Areas:

Semantic Search Internet Services

With the increasing adoption of large language models (LLMs), ensuring the safety of LLM systems has become a pressing concern. External LLM-based guardrail models have emerged as a popular solution to screen unsafe inputs and outputs, but they are themselves fine-tuned or prompt-engineered LLMs that are vulnerable to data distribution shifts. In this paper, taking Retrieval Augmentation Generation (RAG) as a case study, we investigated how robust LLM-based guardrails are against additional information embedded in the context. Through a systematic evaluation of 3 Llama Guards and 2 GPT-oss models, we confirmed that inserting benign documents into the guardrail context alters the judgments of input and output guardrails in around 11% and 8% of cases, making them unreliable. We separately analyzed the effect of each component in the augmented context: retrieved documents, user query, and LLM-generated response. The two mitigation methods we tested only bring minor improvements. These results expose a context-robustness gap in current guardrails and motivate training and evaluation protocols that are robust to retrieval and query composition.

Adapting Large Language Models to Emerging Cybersecurity using Retrieval Augmented Generation

Cryptography and Security

Helps computers spot new cyber threats faster.

31 Oct 2025 0

93%

Hoist with His Own Petard: Inducing Guardrails to Facilitate Denial-of-Service Attacks on Retrieval-Augmented Generation of LLMs

Cryptography and Security

Makes AI systems ignore good questions by tricking safety rules.

30 Apr 2025 0

92%

RAGuard: A Novel Approach for in-context Safe Retrieval Augmented Generation for LLMs

Artificial Intelligence

Keeps wind turbines safe and working right.

3 Sep 2025 0

View PDF Login to Bookmark

Page Count

19 pages

RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts

Makes AI safety checks unreliable with extra information.

Technical Abstract

Adapting Large Language Models to Emerging Cybersecurity using Retrieval Augmented Generation

Hoist with His Own Petard: Inducing Guardrails to Facilitate Denial-of-Service Attacks on Retrieval-Augmented Generation of LLMs

RAGuard: A Novel Approach for in-context Safe Retrieval Augmented Generation for LLMs