Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against
By: Tsogt-Ochir Enkhbayar
Potential Business Impact:
Computers still copy bad examples, even when warned.
Warning-framed content in training data (e.g., "DO NOT USE - this code is vulnerable") does not, it turns out, teach language models to avoid the warned-against behavior. In experiments reported here, models exposed to such warnings reproduced the flagged content at rates statistically indistinguishable from models given the content directly (76.7% vs. 83.3%). Why? Sparse autoencoder analysis points to a failure of orthogonalization: "describing X" and "performing X" activate overlapping latent features. Feature #8684, which tracks code execution patterns, fires at comparable magnitude in both warning and exploitation contexts. A related phenomenon, what I call "stealth slip", allows conversational preambles to rotate activations into subspaces that linear probes miss entirely. Prompting and inference-time steering do not fix this; training-time feature ablation does. The upshot is that statistical co-occurrence dominates over pragmatic interpretation in current architectures. Models learn what tends to follow a context, not why it appeared there.
Similar Papers
Prompt-Based Safety Guidance Is Ineffective for Unlearned Text-to-Image Diffusion Models
Machine Learning (CS)
Makes AI image makers safer from bad prompts.
Don't Forget It! Conditional Sparse Autoencoder Clamping Works for Unlearning
Machine Learning (CS)
Teaches AI to forget dangerous knowledge.
Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load
Computation and Language
Makes AI forget things it's told not to.