Turning LLM Activations Quantization-Friendly
By: Patrik Czakó, Gábor Kertész, Sándor Szénási
Potential Business Impact:
Makes AI smarter and cheaper to run.
Quantization effectively reduces the serving costs of Large Language Models (LLMs) by speeding up data movement through compressed parameters and enabling faster operations via integer arithmetic. However, activating integer arithmetic requires quantizing both weights and activations, which poses challenges due to the significant outliers in LLMs that increase quantization error. In this work, we investigate these outliers with an emphasis on their effect on layer-wise quantization error, then examine how smoothing and rotation transform the observed values. Our primary contributions include introducing a new metric to measure and visualize quantization difficulty based on channel magnitudes, as well as proposing a hybrid approach that applies channel-wise scaling before rotation, supported by a mathematical formulation of its benefits.
Similar Papers
Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs
Machine Learning (CS)
Makes smart computer brains run faster on phones.
Achieving binary weight and activation for LLMs using Post-Training Quantization
Machine Learning (CS)
Makes big AI models much smaller and faster.
KLLM: Fast LLM Inference with K-Means Quantization
Machine Learning (CS)
Makes AI smarter and faster using less computer power.