Lossless Compression of Neural Network Components: Weights, Checkpoints, and K/V Caches in Low-Precision Formats
By: Anat Heilper, Doron Singer
Potential Business Impact:
Shrinks AI models to save space and speed.
As deep learning models grow and deployment becomes more widespread, reducing the storage and transmission costs of neural network weights has become increasingly important. While prior work such as ZipNN has shown that lossless compression methods - particularly those based on Huffman encoding floating-point exponents can significantly reduce model sizes, these techniques have primarily been applied to higher-precision formats such as FP32 and BF16. In this work, we extend the ZipNN approach to lower-precision floating-point formats, specifically FP8 and FP4, which are gaining popularity for efficient inference. We design a compression method that separates and compresses the exponent and mantissa components independently using entropy coding. Our evaluation shows compression ratios up to 62% for BF16 and 83% for FP8. We also investigate the compressibility of key-value (K/V) cache tensors used in large language models (LLMs), finding that they, too, exhibit compressible patterns, enabling memory savings during deployment.
Similar Papers
To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration
Machine Learning (CS)
Makes AI models use less memory and run faster.
Neural Weight Compression for Language Models
Machine Learning (CS)
Makes AI models smaller and faster to use.
An Efficient Compression of Deep Neural Network Checkpoints Based on Prediction and Context Modeling
Machine Learning (CS)
Shrinks computer learning files to save space.