Towards Efficient LLM Storage Reduction via Tensor Deduplication and Delta Compression
By: Zirui Wang , Tingfeng Lan , Zhaoyuan Su and more
Potential Business Impact:
Saves space by shrinking computer language models.
Modern model hubs, such as Hugging Face, store tens of petabytes of LLMs, with fine-tuned variants vastly outnumbering base models and dominating storage consumption. Existing storage reduction techniques -- such as deduplication and compression -- are either LLM oblivious or not compatible with each other, limiting data reduction effectiveness. Our large-scale characterization study across all publicly available Hugging Face LLM repositories reveals several key insights: (1) fine-tuned models within the same family exhibit highly structured, sparse parameter differences suitable for delta compression; (2) bitwise similarity enables LLM family clustering; and (3) tensor-level deduplication offers strong synergy with model aware compressors. Building on these insights, we present BitX, an effective, fast, lossless delta compression algorithm that compresses XORed redundancy between fine-tuned and base LLMs. We build zLLM, a model storage reduction pipeline that unifies tensor-level deduplication and lossless BitX compression. By synergizing deduplication and compression around LLM family clustering, zLLM reduces model storage consumption by 49.5 percent, over 20 percent more than state-of-the-art deduplication and compression designs.
Similar Papers
Lossless Compression for LLM Tensor Incremental Snapshots
Machine Learning (CS)
Makes AI training faster by shrinking data.
Reimagining Memory Access for LLM Inference: Compression-Aware Memory Controller Design
Hardware Architecture
Makes AI smarter using less computer memory.
ADAMIX: Adaptive Mixed-Precision Delta-Compression with Quantization Error Optimization for Large Language Models
Machine Learning (CS)
Makes many AI models smaller and faster.