Score: 0

Towards Superior Quantization Accuracy: A Layer-sensitive Approach

Published: March 9, 2025 | arXiv ID: 2503.06518v1

By: Feng Zhang , Yanbin Liu , Weihua Li and more

Potential Business Impact:

Makes smart computer brains smaller and faster.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large Vision and Language Models have exhibited remarkable human-like intelligence in tasks such as natural language comprehension, problem-solving, logical reasoning, and knowledge retrieval. However, training and serving these models require substantial computational resources, posing a significant barrier to their widespread application and further research. To mitigate this challenge, various model compression techniques have been developed to reduce computational requirements. Nevertheless, existing methods often employ uniform quantization configurations, failing to account for the varying difficulties across different layers in quantizing large neural network models. This paper tackles this issue by leveraging layer-sensitivity features, such as activation sensitivity and weight distribution Kurtosis, to identify layers that are challenging to quantize accurately and allocate additional memory budget. The proposed methods, named SensiBoost and KurtBoost, respectively, demonstrate notable improvement in quantization accuracy, achieving up to 9% lower perplexity with only a 2% increase in memory budget on LLama models compared to the baseline.

Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models

Machine Learning (CS)

Makes smart computer programs smaller and faster.

5 Aug 2025 1

89%

Precision Where It Matters: A Novel Spike Aware Mixed-Precision Quantization Strategy for LLaMA-based Language Models

Computation and Language

Makes big AI models run faster and smaller.

30 Apr 2025 1

89%

Turning LLM Activations Quantization-Friendly

Machine Learning (CS)

Makes AI smarter and cheaper to run.

11 May 2025 1

View PDF Login to Bookmark

Page Count

17 pages

Towards Superior Quantization Accuracy: A Layer-sensitive Approach

Makes smart computer brains smaller and faster.

Technical Abstract

Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models

Precision Where It Matters: A Novel Spike Aware Mixed-Precision Quantization Strategy for LLaMA-based Language Models

Turning LLM Activations Quantization-Friendly