Score: 0

Accelerating Large Language Model Inference via Early-Exiting Algorithms

Published: September 7, 2025 | arXiv ID: 2509.05915v1

By: Sangmin Bae

Potential Business Impact:

Makes smart computer programs run faster and cheaper.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models have achieved remarkable capabilities, but their practical deployment is hindered by significant computational costs. While adaptive computation methods like early-exiting promise to reduce these costs, they introduce a fundamental conflict: the per-token dynamism intended to save computation often creates system-level bottlenecks that can paradoxically reduce throughput in batched inference. This dissertation resolves this conflict by co-designing adaptive algorithms and model architectures to strike an optimal balance between dynamism and efficiency. To this end, our work first addresses critical sources of overhead in conventional early-exiting by proposing an efficient parallel decoding mechanism. We then show that deep parameter sharing provides an architectural foundation that not only yields compact, parameter-efficient models but also inherently mitigates the critical synchronization issues affecting dynamic inference. Finally, this work presents a unified framework where lightweight routers are pretrained to dynamically assign an optimal recursion depth for each token. This approach establishes a new Pareto frontier between efficiency and performance by effectively optimizing for both adaptive computation and parameter efficiency within a single model.

Page Count
136 pages

Category
Computer Science:
Computation and Language