Score: 2

Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-aware Cache Compression

Published: May 11, 2025 | arXiv ID: 2505.06901v1

By: Feng Cheng , Cong Guo , Chiyue Wei and more

BigTech Affiliations: AMD

Potential Business Impact:

Makes smart computers run faster and use less memory.

Business Areas:

Quantum Computing Science and Engineering

Large language models (LLMs) have demonstrated transformative capabilities across diverse artificial intelligence applications, yet their deployment is hindered by substantial memory and computational demands, especially in resource-constrained environments. Quantization techniques have emerged as a critical solution, reducing data precision to enhance memory and computational efficiency. However, existing methods often suffer from high runtime overheads and potential accuracy degradation. To address these challenges, we propose Ecco, an entropy-based cache compression technique tailored for LLMs. Ecco combines group-wise and non-uniform quantization with pre-defined shared k-means patterns and Huffman coding to exploit the inherent entropy characteristics of LLM cache data. Recognizing the inefficiencies of traditional Huffman coding in terms of parallelism and latency, we introduce a novel parallel Huffman-based decoding process with a multi-stage pipeline design, reducing latency by two orders of magnitude and achieving throughput comparable to GPU L2 caches. Comprehensive evaluations demonstrate that Ecco achieves an up to 2.9$\times$ and 1.9$\times$ speedup over the state-of-the-art AWQ and SmoothQuant framework, 2.4$\times$ over the Olive accelerator, all while increasing memory capacity by nearly 4$\times$ and maintaining state-of-the-art LLM accuracy. These results underscore the effectiveness of our entropy-based cache compression in enhancing LLM performance and efficiency, paving the way for more deployable large-scale AI models.

Reimagining Memory Access for LLM Inference: Compression-Aware Memory Controller Design

Hardware Architecture

Makes AI smarter using less computer memory.

24 Mar 2025 0

89%

EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices

Machine Learning (CS)

Makes big AI models fit on small devices.

5 May 2025 0

87%

EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training

Machine Learning (CS)

Makes AI learn much faster without losing smartness.

13 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

15 pages

Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-aware Cache Compression

Makes smart computers run faster and use less memory.

Technical Abstract

Reimagining Memory Access for LLM Inference: Compression-Aware Memory Controller Design

EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices

EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training