Score: 2

In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention

Published: March 17, 2025 | arXiv ID: 2503.12734v2

By: Jianliang He , Xintian Pan , Siyu Chen and more

Potential Business Impact:

Teaches computers to learn from examples quickly.

Business Areas:

Semantic Search Internet Services

We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Through extensive empirical experiments and rigorous theoretical analysis, we demystify the emergence of elegant attention patterns: a diagonal and homogeneous pattern in the key-query (KQ) weights, and a last-entry-only and zero-sum pattern in the output-value (OV) weights. Remarkably, these patterns consistently appear from gradient-based training starting from random initialization. Our analysis reveals that such emergent structures enable multi-head attention to approximately implement a debiased gradient descent predictor -- one that outperforms single-head attention and nearly achieves Bayesian optimality up to proportional factor. Furthermore, compared to linear transformers, the softmax attention readily generalizes to sequences longer than those seen during training. We also extend our study to scenarios with anisotropic covariates and multi-task linear regression. In the former, multi-head attention learns to implement a form of pre-conditioned gradient descent. In the latter, we uncover an intriguing regime where the interplay between head number and task number triggers a superposition phenomenon that efficiently resolves multi-task in-context learning. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution, paving the way for deeper understanding and broader applications of in-context learning.

Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

Machine Learning (CS)

Makes AI learn better with longer instructions.

12 Dec 2025 0

87%

In-Context Algorithm Emulation in Fixed-Weight Transformers

Machine Learning (CS)

Computers learn new tricks from just instructions.

24 Aug 2025 2

87%

Intrinsic and Extrinsic Organized Attention: Softmax Invariance and Network Sparsity

Numerical Analysis

Makes AI understand itself better for new uses.

18 Jun 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 🇺🇸 China, United States

Repos / Data Links

github.com

Page Count

79 pages

In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention

Teaches computers to learn from examples quickly.

Technical Abstract

Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

In-Context Algorithm Emulation in Fixed-Weight Transformers

Intrinsic and Extrinsic Organized Attention: Softmax Invariance and Network Sparsity