Score: 1

Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons

Published: October 15, 2025 | arXiv ID: 2510.13797v1

By: Giovanni Monea , Yair Feldman , Shankar Padmanabhan and more

Potential Business Impact:

Makes AI remember more without using too much memory.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The scalability of large language models for long-context reasoning is severely constrained by the linear growth of their Transformer key-value cache, which incurs significant memory and computational costs. We posit that as a model generates reasoning tokens, the informational value of past generated tokens diminishes, creating an opportunity for compression. In this work, we propose to periodically compress the generation KV cache with a learned, special-purpose token and evict compressed entries. We train the model to perform this compression via a modified joint distillation and reinforcement learning (RL) framework. Our training method minimizes overhead over the conventional RL process, as it leverages RL outputs for distillation. Empirically, our method achieves a superior memory-accuracy Pareto frontier compared to both the model without cache compression and training-free compression techniques.