Score: 0

To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration

Published: October 3, 2025 | arXiv ID: 2510.02676v1

By: Zeyu Yang , Tianyi Zhang , Jianwen Xie and more

Potential Business Impact:

Makes AI models use less memory and run faster.

Business Areas:

Quantum Computing Science and Engineering

The scaling of Generative AI (GenAI) models into the hundreds of billions of parameters makes low-precision computation indispensable for efficient deployment. We argue that the fundamental solution lies in developing low-precision floating-point formats, which inherently provide numerical stability, memory savings, and hardware efficiency without dequantization overhead. In this paper, we present a theoretical and empirical study of an exponent concentration phenomenon in GenAI weights: exponents consistently exhibit low entropy across architectures and modalities. We show that this arises naturally from $\alpha$-stable distributions induced by stochastic gradient descent, and we prove tight bounds on the entropy of exponents. Our analysis establishes a theoretical compression limit near FP4.67, which motivates the design of a practical FP8 format. Building on these insights, we propose Exponent-Concentrated FP8 (ECF8), a lossless compression framework with entropy-aware encoding and GPU-optimized decoding. Experiments on LLMs and DiTs up to 671B parameters demonstrate up to 26.9% memory savings and 177.1% throughput acceleration, with perfectly lossless computations, i.e., no deviation in model outputs. Our results establish exponent concentration as a statistical law of trained models and open a principled path for lossless low-precision floating-point design in the FP8 era.

Lossless Compression of Neural Network Components: Weights, Checkpoints, and K/V Caches in Low-Precision Formats

Machine Learning (CS)

Shrinks AI models to save space and speed.

20 Aug 2025 1

87%

Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme

Distributed, Parallel, and Cluster Computing

Makes computers do hard math faster and more accurately.

16 Nov 2025 1

87%

Scaling Laws for Floating Point Quantization Training

Machine Learning (CS)

Makes AI models run faster and use less power.

5 Jan 2025 2

View PDF Login to Bookmark

Page Count

19 pages

To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration

Makes AI models use less memory and run faster.

Technical Abstract

Lossless Compression of Neural Network Components: Weights, Checkpoints, and K/V Caches in Low-Precision Formats

Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme

Scaling Laws for Floating Point Quantization Training