Predicting the Formation of Induction Heads
By: Tatsuya Aoyama, Ethan Gotlieb Wilcox, Nathan Schneider
Potential Business Impact:
Teaches AI to learn from examples faster.
Arguably, specialized attention heads dubbed induction heads (IHs) underlie the remarkable in-context learning (ICL) capabilities of modern language models (LMs); yet, a precise characterization of their formation remains unclear. In this study, we investigate the relationship between statistical properties of training data (for both natural and synthetic data) and IH formation. We show that (1) a simple equation combining batch size and context size predicts the point at which IHs form; (2) surface bigram repetition frequency and reliability strongly affect the formation of IHs, and we find a precise Pareto frontier in terms of these two values; and (3) local dependency with high bigram repetition frequency and reliability is sufficient for IH formation, but when the frequency and reliability are low, categoriality and the shape of the marginal distribution matter.
Similar Papers
On the Emergence of Induction Heads for In-Context Learning
Artificial Intelligence
Helps computers learn new things from examples.
The Initialization Determines Whether In-Context Learning Is Gradient Descent
Machine Learning (CS)
Makes AI learn better by guessing answers.
How Does the Pretraining Distribution Shape In-Context Learning? Task Selection, Generalization, and Robustness
Machine Learning (CS)
Teaches computers to learn new things from examples.