Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons
By: Giovanni Monea , Yair Feldman , Shankar Padmanabhan and more
Potential Business Impact:
Makes AI remember more without using too much memory.
The scalability of large language models for long-context reasoning is severely constrained by the linear growth of their Transformer key-value cache, which incurs significant memory and computational costs. We posit that as a model generates reasoning tokens, the informational value of past generated tokens diminishes, creating an opportunity for compression. In this work, we propose to periodically compress the generation KV cache with a learned, special-purpose token and evict compressed entries. We train the model to perform this compression via a modified joint distillation and reinforcement learning (RL) framework. Our training method minimizes overhead over the conventional RL process, as it leverages RL outputs for distillation. Empirically, our method achieves a superior memory-accuracy Pareto frontier compared to both the model without cache compression and training-free compression techniques.
Similar Papers
Hold Onto That Thought: Assessing KV Cache Compression On Reasoning
Computation and Language
Helps AI remember more for complex thinking.
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
Computation and Language
Makes AI think better by saving memory.
G-KV: Decoding-Time KV Cache Eviction with Global Attention
Computation and Language
Makes AI remember more without slowing down.