Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data
By: Haoran Deng , Yingyu Lin , Zhenghao Lin and more
Potential Business Impact:
Teaches computers to learn from very long texts.
Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using only local context. Training on such data is inefficient, making careful data selection crucial. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.
Similar Papers
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models
Computation and Language
Lets computers understand much longer stories.
LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges?
Computation and Language
Tests if computers can understand very long texts.
Context Length Alone Hurts LLM Performance Despite Perfect Retrieval
Computation and Language
Makes computers understand long stories better.