Score: 1

Lossless Compression of Neural Network Components: Weights, Checkpoints, and K/V Caches in Low-Precision Formats

Published: August 20, 2025 | arXiv ID: 2508.19263v1

By: Anat Heilper, Doron Singer

BigTech Affiliations: Intel

Potential Business Impact:

Shrinks AI models to save space and speed.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

As deep learning models grow and deployment becomes more widespread, reducing the storage and transmission costs of neural network weights has become increasingly important. While prior work such as ZipNN has shown that lossless compression methods - particularly those based on Huffman encoding floating-point exponents can significantly reduce model sizes, these techniques have primarily been applied to higher-precision formats such as FP32 and BF16. In this work, we extend the ZipNN approach to lower-precision floating-point formats, specifically FP8 and FP4, which are gaining popularity for efficient inference. We design a compression method that separates and compresses the exponent and mantissa components independently using entropy coding. Our evaluation shows compression ratios up to 62% for BF16 and 83% for FP8. We also investigate the compressibility of key-value (K/V) cache tensors used in large language models (LLMs), finding that they, too, exhibit compressible patterns, enabling memory savings during deployment.

To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration

Machine Learning (CS)

Makes AI models use less memory and run faster.

3 Oct 2025 0

88%

Neural Weight Compression for Language Models

Machine Learning (CS)

Makes AI models smaller and faster to use.

13 Oct 2025 1

87%

An Efficient Compression of Deep Neural Network Checkpoints Based on Prediction and Context Modeling

Machine Learning (CS)

Shrinks computer learning files to save space.

13 Jun 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

16 pages

Lossless Compression of Neural Network Components: Weights, Checkpoints, and K/V Caches in Low-Precision Formats

Shrinks AI models to save space and speed.

Technical Abstract

To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration

Neural Weight Compression for Language Models

An Efficient Compression of Deep Neural Network Checkpoints Based on Prediction and Context Modeling