Score: 1

Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load

Published: November 15, 2025 | arXiv ID: 2511.12381v1

By: Logan Mann , Nayan Saxena , Sarah Tandon and more

Potential Business Impact:

Makes AI forget things it's told not to.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Negation instructions such as 'do not mention $X$' can paradoxically increase the accessibility of $X$ in human thought, a phenomenon known as ironic rebound. Large language models (LLMs) face the same challenge: suppressing a concept requires internally activating it, which may prime rebound instead of avoidance. We investigated this tension with two experiments. \textbf{(1) Load \& content}: after a negation instruction, we vary distractor text (semantic, syntactic, repetition) and measure rebound strength. \textbf{(2) Polarity separation}: We test whether models distinguish neutral from negative framings of the same concept and whether this separation predicts rebound persistence. Results show that rebound consistently arises immediately after negation and intensifies with longer or semantic distractors, while repetition supports suppression. Stronger polarity separation correlates with more persistent rebound. Together, these findings, complemented by a circuit tracing analysis that identifies sparse middle-layer attention heads amplifying forbidden tokens while early layers suppress, link cognitive predictions of ironic rebound with mechanistic insights into long-context interference. To support future work, we release ReboundBench, a dataset of $5,000$ systematically varied negation prompts designed to probe rebound in LLMs.

Country of Origin
πŸ‡ΊπŸ‡Έ United States

Repos / Data Links

Page Count
12 pages

Category
Computer Science:
Computation and Language