Score: 1

Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction

Published: September 15, 2025 | arXiv ID: 2509.12464v1

By: Ryan Lucas , Kayhan Behdin , Zhipeng Wang and more

Potential Business Impact:

Makes smart AI think faster without losing smarts.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce more thinking tokens but with worse performance. We show that this is partly due to the fact that standard LLM pruning methods often focus on input reconstruction, whereas reasoning is a decode-dominated task. We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model's on-policy chain-of-thought traces. This "Reasoning-Aware Compression" (RAC) integrates seamlessly into existing pruning workflows such as SparseGPT, and boosts their performance significantly. Code reproducing the results in the paper can be found at: https://github.com/RyanLucas3/RAC