Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs
By: Aayush Gupta
Potential Business Impact:
Stops AI from being tricked by bad instructions.
Large language models (LLMs) remain acutely vulnerable to prompt injection and related jailbreak attacks; heuristic guardrails (rules, filters, LLM judges) are routinely bypassed. We present Contextual Integrity Verification (CIV), an inference-time security architecture that attaches cryptographically signed provenance labels to every token and enforces a source-trust lattice inside the transformer via a pre-softmax hard attention mask (with optional FFN/residual gating). CIV provides deterministic, per-token non-interference guarantees on frozen models: lower-trust tokens cannot influence higher-trust representations. On benchmarks derived from recent taxonomies of prompt-injection vectors (Elite-Attack + SoK-246), CIV attains 0% attack success rate under the stated threat model while preserving 93.1% token-level similarity and showing no degradation in model perplexity on benign tasks; we note a latency overhead attributable to a non-optimized data path. Because CIV is a lightweight patch -- no fine-tuning required -- we demonstrate drop-in protection for Llama-3-8B and Mistral-7B. We release a reference implementation, an automated certification harness, and the Elite-Attack corpus to support reproducible research.
Similar Papers
Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs
Cryptography and Security
Stops AI from being tricked by bad instructions.
Contextual Integrity in LLMs via Reasoning and Reinforcement Learning
Artificial Intelligence
Teaches AI what private info to share.
Position: Contextual Integrity is Inadequately Applied to Language Models
Computers and Society
Makes AI share information more safely and fairly.