On Vanishing Variance in Transformer Length Generalization
By: Ruining Li , Gabrijel Boduljak , Jensen and more
Potential Business Impact:
Makes AI better at remembering longer stories.
It is a widely known issue that Transformers, when trained on shorter sequences, fail to generalize robustly to longer ones at test time. This raises the question of whether Transformer models are real reasoning engines, despite their impressive abilities in mathematical problem solving and code synthesis. In this paper, we offer a vanishing variance perspective on this issue. To the best of our knowledge, we are the first to demonstrate that even for today's frontier models, a longer sequence length results in a decrease in variance in the output of the multi-head attention modules. On the argmax retrieval and dictionary lookup tasks, our experiments show that applying layer normalization after the attention outputs leads to significantly better length generalization. Our analyses attribute this improvement to a reduction-though not a complete elimination-of the distribution shift caused by vanishing variance.
Similar Papers
Quantitative Bounds for Length Generalization in Transformers
Machine Learning (CS)
Makes AI understand longer text by training it more.
Extrapolation by Association: Length Generalization Transfer in Transformers
Computation and Language
Helps computers learn longer tasks from similar ones.
Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization
Machine Learning (CS)
AI learns to solve harder problems with longer thinking.