Score: 1

Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models

Published: August 5, 2025 | arXiv ID: 2508.03332v1

By: He Xiao , Qingyao Yang , Dirui Xie and more

Potential Business Impact:

Makes smart computer programs smaller and faster.

Large language models with billions of parameters are often over-provisioned: many layers contribute little unique information yet dominate the memory and energy footprint during inference. We present LieQ, a metric-driven post-training quantization framework that addresses the critical challenge of maintaining accuracy in sub-7B models under extreme low-bit compression. Our method introduces three complementary layer-wise diagnostics-Perplexity Drop, Representational Compactness, and Top-k Energy Gain -that reveal a canonical division of labour across layers, enabling automatic bit-width allocation without gradient updates. Unlike existing approaches that suffer severe accuracy degradation at 2-3 bits precision, LieQ achieves state-of-the-art compression-accuracy trade-offs: on Qwen3-4B, it recovers 95.9% of FP16 baseline performance at 2.05-bit quantization, outperforming GPTQ by 19.7% and AWQ by 18.1% on average across seven zero-shot reasoning tasks. Applied to LLaMA3.2-3B, LieQ maintains 98.2% of baseline accuracy at 2.07-bit precision while enabling 4x memory reduction, establishing new paradigms for deploying small language models on resource-constrained edge devices.

InfiJanice: Joint Analysis and In-situ Correction Engine for Quantization-Induced Math Degradation in Large Language Models

Machine Learning (CS)

Fixes AI math mistakes after shrinking it.

16 May 2025 0

90%

LLM Compression: How Far Can We Go in Balancing Size and Performance?

Computation and Language

Makes smart computer programs run faster and smaller.

15 Aug 2025 2

90%

Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models

Computation and Language

Makes big AI models smaller without losing smarts.

26 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇭🇰 Hong Kong

Page Count

9 pages

Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models

Makes smart computer programs smaller and faster.

Technical Abstract

InfiJanice: Joint Analysis and In-situ Correction Engine for Quantization-Induced Math Degradation in Large Language Models

LLM Compression: How Far Can We Go in Balancing Size and Performance?

Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models