LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models
By: Stanislav Budzinskiy , Marian Gloser , Tolunay Yilmaz and more
Potential Business Impact:
Makes AI smarter and faster using less power.
Mixed-precision computations are a hallmark of the current stage of AI, driving the progress in large language models towards efficient, locally deployable solutions. This article addresses the floating-point computation of compositionally-rich functions, concentrating on transformer inference. Based on the rounding error analysis of a composition $f(g(\mathrm{x}))$, we provide an adaptive strategy that selects a small subset of components of $g(\mathrm{x})$ to be computed more accurately while all other computations can be carried out with lower accuracy. We then explain how this strategy can be applied to different compositions within a transformer and illustrate its overall effect on transformer inference. We study the effectiveness of this algorithm numerically on GPT-2 models and demonstrate that already very low recomputation rates allow for improvements of up to two orders of magnitude in accuracy.
Similar Papers
Efficient Mixed-Precision Large Language Model Inference with TurboMind
Distributed, Parallel, and Cluster Computing
Makes AI models run faster and use less power.
Architectural Trade-offs in Small Language Models Under Compute Constraints
Computation and Language
Makes small AI models smarter with less computer power.
Experts are all you need: A Composable Framework for Large Language Model Inference
Machine Learning (CS)
Makes AI smarter and faster by teamwork.