Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries
By: Divyat Mahajan , Sachin Goyal , Badr Youbi Idrissi and more
Potential Business Impact:
Helps computers write longer, smarter stories.
Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.
Similar Papers
Predicting the Order of Upcoming Tokens Improves Language Modeling
Machine Learning (CS)
Teaches computers to guess words better.
Context-level Language Modeling by Learning Predictive Context Embeddings
Computation and Language
Makes AI understand stories better, not just words.
Training LLMs Beyond Next Token Prediction -- Filling the Mutual Information Gap
Computation and Language
Teaches AI to learn faster and better.