Score: 0

Semantic Gravity Wells: Why Negative Constraints Backfire

Published: January 12, 2026 | arXiv ID: 2601.08070v1

By: Shailesh Rana

Negative constraints (instructions of the form "do not use word X") represent a fundamental test of instruction-following capability in large language models. Despite their apparent simplicity, these constraints fail with striking regularity, and the conditions governing failure have remained poorly understood. This paper presents the first comprehensive mechanistic investigation of negative instruction failure. We introduce semantic pressure, a quantitative measure of the model's intrinsic probability of generating the forbidden token, and demonstrate that violation probability follows a tight logistic relationship with pressure ($p=σ(-2.40+2.27\cdot P_0)$; $n=40{,}000$ samples; bootstrap $95%$ CI for slope: $[2.21,,2.33]$). Through layer-wise analysis using the logit lens technique, we establish that the suppression signal induced by negative instructions is present but systematically weaker in failures: the instruction reduces target probability by only 5.2 percentage points in failures versus 22.8 points in successes -- a $4.4\times$ asymmetry. We trace this asymmetry to two mechanistically distinct failure modes. In priming failure (87.5% of violations), the instruction's explicit mention of the forbidden word paradoxically activates rather than suppresses the target representation. In override failure (12.5%), late-layer feed-forward networks generate contributions of $+0.39$ toward the target probability -- nearly $4\times$ larger than in successes -- overwhelming earlier suppression signals. Activation patching confirms that layers 23--27 are causally responsible: replacing these layers' activations flips the sign of constraint effects. These findings reveal a fundamental tension in negative constraint design: the very act of naming a forbidden word primes the model to produce it.

Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load

Computation and Language

Makes AI forget things it's told not to.

15 Nov 2025 1

87%

Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against

Machine Learning (CS)

Computers still copy bad examples, even when warned.

25 Dec 2025 0

86%

A Multifaceted Analysis of Negative Bias in Large Language Models through the Lens of Parametric Knowledge

Computation and Language

AI answers "no" when it doesn't know.

14 Nov 2025 2

View PDF Login to Bookmark

Semantic Gravity Wells: Why Negative Constraints Backfire

Technical Abstract

Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load

Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against

A Multifaceted Analysis of Negative Bias in Large Language Models through the Lens of Parametric Knowledge