The Mean-Field Dynamics of Transformers
By: Philippe Rigollet
Potential Business Impact:
Makes AI understand long texts better by grouping ideas.
We develop a mathematical framework that interprets Transformer attention as an interacting particle system and studies its continuum (mean-field) limits. By idealizing attention continuous on the sphere, we connect Transformer dynamics to Wasserstein gradient flows, synchronization models (Kuramoto), and mean-shift clustering. Central to our results is a global clustering phenomenon whereby tokens cluster asymptotically after long metastable states where they are arranged into multiple clusters. We further analyze a tractable equiangular reduction to obtain exact clustering rates, show how commonly used normalization schemes alter contraction speeds, and identify a phase transition for long-context attention. The results highlight both the mechanisms that drive representation collapse and the regimes that preserve expressive, multi-cluster structure in deep attention architectures.
Similar Papers
Quantitative Clustering in Mean-Field Transformer Models
Machine Learning (CS)
Makes AI learn faster by grouping ideas.
A Mathematical Explanation of Transformers for Large Language Models and GPTs
Machine Learning (CS)
Explains how AI learns by seeing patterns.
The Curved Spacetime of Transformer Architectures
Machine Learning (CS)
Makes AI understand words by bending their meanings.