DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment
By: Sangwoo Kwon , Seong Hoon Seo , Jae W. Lee and more
Potential Business Impact:
Makes AI models faster and smarter on phones.
How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding iterations. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. DP-LLM augments each linear layer in an LLM with a precision selector that determines the bitwidth at runtime using a lightweight error estimator and threshold values learned through fine-tuning. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.
Similar Papers
Mixed-Precision Quantization for Language Models: Techniques and Prospects
Machine Learning (CS)
Makes smart computer programs smaller and faster.
DLLMQuant: Quantizing Diffusion-based Large Language Models
Computation and Language
Makes AI write faster and smaller.
Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs
Computation and Language
Makes big AI models run on small phones.