Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs
By: Rachit Bansal , Aston Zhang , Rishabh Tiwari and more
Potential Business Impact:
Helps computers remember and use more information.
Progress on training and architecture strategies has enabled LLMs with millions of tokens in context length. However, empirical evidence suggests that such long-context LLMs can consume far more text than they can reliably use. On the other hand, it has been shown that inference-time compute can be used to scale performance of LLMs, often by generating thinking tokens, on challenging tasks involving multi-step reasoning. Through controlled experiments on sandbox long-context tasks, we find that such inference-time strategies show rapidly diminishing returns and fail at long context. We attribute these failures to score dilution, a phenomenon inherent to static self-attention. Further, we show that current inference-time strategies cannot retrieve relevant long-context signals under certain conditions. We propose a simple method that, through targeted gradient updates on the given context, provably overcomes limitations of static self-attention. We find that this shift in how inference-time compute is spent leads to consistently large performance improvements across models and long-context benchmarks. Our method leads to large 12.6 and 14.1 percentage point improvements for Qwen3-4B on average across subsets of LongBench-v2 and ZeroScrolls benchmarks. The takeaway is practical: for long context, a small amount of context-specific training is a better use of inference compute than current inference-time scaling strategies like producing more thinking tokens.
Similar Papers
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models
Computation and Language
Lets computers understand much longer stories.
Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models
Computation and Language
Lets computers remember much longer stories.
Context Length Alone Hurts LLM Performance Despite Perfect Retrieval
Computation and Language
Makes computers understand long stories better.