Achieving binary weight and activation for LLMs using Post-Training Quantization
By: Siqing Song , Chuang Wang , Ruiqi Wang and more
Potential Business Impact:
Makes big AI models much smaller and faster.
Quantizing large language models (LLMs) to 1-bit precision significantly reduces computational costs, but existing quantization techniques suffer from noticeable performance degradation when using weight and activation precisions below 4 bits (W4A4). In this paper, we propose a post-training quantization framework with W(1+1)A(1*4) configuration, where weights are quantized to 1 bit with an additional 1 bit for fine-grain grouping and activations are quantized to 1 bit with a 4-fold increase in the number of channels. For weight quantization, we propose utilizing Hessian-aware fine-grained grouping along with an EM-based quantization scheme. For activation quantization, we decompose INT4-quantized activations into a 4 * INT1 format equivalently and simultaneously smooth the scaling factors based on quantization errors, which further reduces the quantization errors in activations. Our method surpasses state-of-the-art (SOTA) LLM quantization baselines on W2A4 across multiple tasks, pushing the boundaries of existing LLM quantization methods toward fully binarized models. Code is available at https://github.com/JimmyCrave/LLM-PTQ-binarization.
Similar Papers
PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models
Machine Learning (CS)
Makes AI models smaller and faster.
Binary Quantization For LLMs Through Dynamic Grouping
Machine Learning (CS)
Makes AI models much smaller and faster.
Binary Neural Networks for Large Language Model: A Survey
Computation and Language
Makes AI models smaller and faster to train.