InfoQ: Mixed-Precision Quantization via Global Information Flow
By: Mehmet Emre Akbulut , Hazem Hesham Yousef Shalby , Fabrizio Pittorino and more
Potential Business Impact:
Makes AI smarter on small devices.
Mixed-precision quantization (MPQ) is crucial for deploying deep neural networks on resource-constrained devices, but finding the optimal bit-width for each layer represents a complex combinatorial optimization problem. Current state-of-the-art methods rely on computationally expensive search algorithms or local sensitivity heuristic proxies like the Hessian, which fail to capture the cascading global effects of quantization error. In this work, we argue that the quantization sensitivity of a layer should not be measured by its local properties, but by its impact on the information flow throughout the entire network. We introduce InfoQ, a novel framework for MPQ that is training-free in the bit-width search phase. InfoQ assesses layer sensitivity by quantizing each layer at different bit-widths and measuring, through a single forward pass, the resulting change in mutual information in the subsequent layers. This quantifies how much each layer quantization impacts the network information flow. The resulting scores are used to formulate bit-width allocation as an integer linear programming problem, which is solved efficiently to minimize total sensitivity under a given budget (e.g., model size or BitOps). Our retraining-free search phase provides a superior search-time/accuracy trade-off (using two orders of magnitude less data compared to state-of-the-art methods such as LIMPQ), while yielding up to a 1% accuracy improvement for MobileNetV2 and ResNet18 on ImageNet at high compression rates (14X and 10.66X).
Similar Papers
Where and How to Enhance: Discovering Bit-Width Contribution for Mixed Precision Quantization
Machine Learning (CS)
Makes AI smarter by using less computer power.
Learning from Loss Landscape: Generalizable Mixed-Precision Quantization via Adaptive Sharpness-Aware Gradient Aligning
CV and Pattern Recognition
Makes AI smarter with less computer power.
IMPQ: Interaction-Aware Layerwise Mixed Precision Quantization for LLMs
Machine Learning (CS)
Makes big AI models run on small devices.