DBellQuant: Breaking the Bell with Double-Bell Transformation for LLMs Post Training Binarization
By: Zijian Ye , Wei Huang , Yifei Yu and more
Potential Business Impact:
Makes smart computer programs much smaller.
Large language models (LLMs) demonstrate remarkable performance but face substantial computational and memory challenges that limit their practical deployment. Quantization has emerged as a promising solution; however, its effectiveness is often limited by quantization errors arising from weight distributions that are not quantization-friendly and the presence of activation outliers. To address these challenges, we introduce DBellQuant, an innovative post-training quantization (PTQ) framework that achieves nearly 1-bit weight compression and 6-bit activation quantization with minimal performance degradation. DBellQuant uses Learnable Transformation for Dual-Bell (LTDB) algorithm, which transforms single-bell weight distributions into dual-bell forms to reduce binarization errors and applies inverse transformations to smooth activations. DBellQuant sets a new state-of-the-art by preserving superior model performance under aggressive weight and activation quantization. For example, on the Wikitext2 dataset, DBellQuant achieves a perplexity of 14.39 on LLaMA2-13B with 6-bit activation quantization, significantly outperforming BiLLM's 21.35 without activation quantization, underscoring its potential in compressing LLMs for real-world applications.
Similar Papers
Binary Neural Networks for Large Language Model: A Survey
Computation and Language
Makes AI models smaller and faster to train.
Achieving binary weight and activation for LLMs using Post-Training Quantization
Machine Learning (CS)
Makes big AI models much smaller and faster.
Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs
Computation and Language
Makes big AI models run on small phones.