Score: 2

On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

Published: July 9, 2025 | arXiv ID: 2507.07341v1

By: Sarah Ball , Greg Gluch , Shafi Goldwasser and more

BigTech Affiliations: University of California, Berkeley

Potential Business Impact:

Makes AI safer by blocking bad ideas.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

With the increased deployment of large language models (LLMs), one concern is their potential misuse for generating harmful content. Our work studies the alignment challenge, with a focus on filters to prevent the generation of unsafe information. Two natural points of intervention are the filtering of the input prompt before it reaches the model, and filtering the output after generation. Our main results demonstrate computational challenges in filtering both prompts and outputs. First, we show that there exist LLMs for which there are no efficient prompt filters: adversarial prompts that elicit harmful behavior can be easily constructed, which are computationally indistinguishable from benign prompts for any efficient filter. Our second main result identifies a natural setting in which output filtering is computationally intractable. All of our separation results are under cryptographic hardness assumptions. In addition to these core findings, we also formalize and study relaxed mitigation approaches, demonstrating further computational barriers. We conclude that safety cannot be achieved by designing filters external to the LLM internals (architecture and weights); in particular, black-box access to the LLM will not suffice. Based on our technical results, we argue that an aligned AI system's intelligence cannot be separated from its judgment.

Helping Large Language Models Protect Themselves: An Enhanced Filtering and Summarization System

Computation and Language

Protects AI from bad instructions without retraining.

2 May 2025 0

87%

The Alignment Trap: Complexity Barriers

Artificial Intelligence

Makes AI safety impossible to guarantee.

12 Jun 2025 0

87%

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Machine Learning (CS)

Keeps AI from learning dangerous secrets.

8 Aug 2025 2

View PDF Login to Bookmark

Country of Origin

🇺🇸 🇩🇪 United States, Germany

Page Count

27 pages

On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

Makes AI safer by blocking bad ideas.

Technical Abstract

Helping Large Language Models Protect Themselves: An Enhanced Filtering and Summarization System

The Alignment Trap: Complexity Barriers

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs