Arithmetic-Intensity-Aware Quantization
By: Taig Singh, Shreshth Rajan, Nikhil Iyer
As modern neural networks become increasingly memory-bound, inference throughput is limited by DRAM bandwidth rather than compute. We present Arithmetic-Intensity-Aware Quantization (AIQ), a mixed precision quantization framework that chooses per-layer bit-widths to maximize arithmetic intensity (AI) while minimizing accuracy loss. AIQ is a post-training quantization method that uses search algorithms over per-layer quantization schemes to minimize a weighted loss over AI and accuracy. On ResNet-20/CIFAR-10, AIQ increases AI by ~50% over an FP32 baseline while keeping test accuracy within ~1 percentage point, and outperforming global uniform quantization schemes. On a memory-bound MobileNetV2 architecture, AIQ configurations give a 1.66x higher throughput than the FP32 baseline while keeping test accuracy within 1 percentage point. We also find that AIQ naturally quantizes larger layers more aggressively.
Similar Papers
Rescaling-Aware Training for Efficient Deployment of Deep Learning Models on Full-Integer Hardware
Machine Learning (CS)
Makes AI on small devices run faster, cheaper.
InfoQ: Mixed-Precision Quantization via Global Information Flow
Machine Learning (CS)
Makes AI smarter on small devices.
Adaptive Distribution-aware Quantization for Mixed-Precision Neural Networks
CV and Pattern Recognition
Makes AI programs run faster on small devices.