Score: 0

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Published: January 15, 2026 | arXiv ID: 2601.10160v1

By: Cameron Tice , Puria Radmard , Samuel Ratnam and more

Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners pretrain for alignment as well as capabilities. Our models and datasets are available at alignmentpretraining.ai

Model Misalignment and Language Change: Traces of AI-Associated Language in Unscripted Spoken English

Computation and Language

Helps people talk more like computers do.

1 Aug 2025 1

88%

Misaligned from Within: Large Language Models Reproduce Our Double-Loop Learning Blindness

Human-Computer Interaction

AI learns our bad habits, hindering progress.

3 Jul 2025 0

88%

Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models

Computation and Language

Makes AI models say bad things when tricked.

6 Aug 2025 2

View PDF Login to Bookmark

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Technical Abstract

Model Misalignment and Language Change: Traces of AI-Associated Language in Unscripted Spoken English

Misaligned from Within: Large Language Models Reproduce Our Double-Loop Learning Blindness

Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models