Accelerating Large Language Model Inference via Early-Exiting Algorithms
By: Sangmin Bae
Potential Business Impact:
Makes smart computer programs run faster and cheaper.
Large language models have achieved remarkable capabilities, but their practical deployment is hindered by significant computational costs. While adaptive computation methods like early-exiting promise to reduce these costs, they introduce a fundamental conflict: the per-token dynamism intended to save computation often creates system-level bottlenecks that can paradoxically reduce throughput in batched inference. This dissertation resolves this conflict by co-designing adaptive algorithms and model architectures to strike an optimal balance between dynamism and efficiency. To this end, our work first addresses critical sources of overhead in conventional early-exiting by proposing an efficient parallel decoding mechanism. We then show that deep parameter sharing provides an architectural foundation that not only yields compact, parameter-efficient models but also inherently mitigates the critical synchronization issues affecting dynamic inference. Finally, this work presents a unified framework where lightweight routers are pretrained to dynamically assign an optimal recursion depth for each token. This approach establishes a new Pareto frontier between efficiency and performance by effectively optimizing for both adaptive computation and parameter efficiency within a single model.
Similar Papers
Token Level Routing Inference System for Edge Devices
Computation and Language
Makes small AI smart enough for big jobs.
Dynamic Early Exit in Reasoning Models
Computation and Language
Computers solve problems faster and better.
A Hybrid Early-Exit Algorithm for Large Language Models Based on Space Alignment Decoding (SPADE)
Computation and Language
Makes smart computer programs faster and cheaper.