APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers
By: Zhuguanyu Wu , Jiayi Zhang , Jiaxin Chen and more
Potential Business Impact:
Makes AI see better with less computer power.
Vision Transformers (ViTs) have become one of the most commonly used backbones for vision tasks. Despite their remarkable performance, they often suffer significant accuracy drops when quantized for practical deployment, particularly by post-training quantization (PTQ) under ultra-low bits. Recently, reconstruction-based PTQ methods have shown promising performance in quantizing Convolutional Neural Networks (CNNs). However, they fail when applied to ViTs, primarily due to the inaccurate estimation of output importance and the substantial accuracy degradation in quantizing post-GELU activations. To address these issues, we propose \textbf{APHQ-ViT}, a novel PTQ approach based on importance estimation with Average Perturbation Hessian (APH). Specifically, we first thoroughly analyze the current approximation approaches with Hessian loss, and propose an improved average perturbation Hessian loss. To deal with the quantization of the post-GELU activations, we design an MLP Reconstruction (MR) method by replacing the GELU function in MLP with ReLU and reconstructing it by the APH loss on a small unlabeled calibration set. Extensive experiments demonstrate that APHQ-ViT using linear quantizers outperforms existing PTQ methods by substantial margins in 3-bit and 4-bit across different vision tasks. The source code is available at https://github.com/GoatWu/APHQ-ViT.
Similar Papers
IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers
CV and Pattern Recognition
Makes computer vision faster without losing quality.
FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation
CV and Pattern Recognition
Makes AI image programs smaller, faster, and more accurate.
VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation
CV and Pattern Recognition
Makes AI models that see and talk smaller.