Probing the Limits of Compressive Memory: A Study of Infini-Attention in Small-Scale Pretraining
By: Ruizhe Huang , Kexuan Zhang , Yihao Fang and more
This study investigates small-scale pretraining for Small Language Models (SLMs) to enable efficient use of limited data and compute, improve accessibility in low-resource settings and reduce costs. To enhance long-context extrapolation in compact models, we focus on Infini-attention, which builds a compressed memory from past segments while preserving local attention. In our work, we conduct an empirical study using 300M-parameter LLaMA models pretrained with Infini-attention. The model demonstrates training stability and outperforms the baseline in long-context retrieval. We identify the balance factor as a key part of the model performance, and we found that retrieval accuracy drops with repeated memory compressions over long sequences. Even so, Infini-attention still effectively compensates for the SLM's limited parameters. Particularly, despite performance degradation at a 16,384-token context, the Infini-attention model achieves up to 31% higher accuracy than the baseline. Our findings suggest that achieving robust long-context capability in SLMs benefits from architectural memory like Infini-attention.
Similar Papers
Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing
Computation and Language
Lets computers remember and use endless information.
Training-free Context-adaptive Attention for Efficient Long Context Modeling
Computation and Language
Makes AI understand long texts faster.
Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs
Machine Learning (CS)
Helps computers remember and use more information.