Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-aware Cache Compression

📅 2025-05-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address memory bandwidth and capacity bottlenecks in KV cache management during large language model (LLM) inference, this paper proposes an information-entropy-aware cache compression method. The approach features: (1) entropy-driven intra-group non-uniform quantization, leveraging a pre-defined shared k-means codebook to jointly optimize accuracy and compression ratio; and (2) a parallel multi-stage pipelined Huffman decoding architecture that significantly reduces decompression latency. Experimental evaluation demonstrates that, compared to AWQ, SmoothQuant, and Olive, the proposed method achieves 2.9×, 1.9×, and 2.4× higher inference throughput, respectively, while reducing KV cache memory footprint by approximately 75%—effectively increasing cache capacity nearly fourfold—without compromising state-of-the-art (SOTA) accuracy.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated transformative capabilities across diverse artificial intelligence applications, yet their deployment is hindered by substantial memory and computational demands, especially in resource-constrained environments. Quantization techniques have emerged as a critical solution, reducing data precision to enhance memory and computational efficiency. However, existing methods often suffer from high runtime overheads and potential accuracy degradation. To address these challenges, we propose Ecco, an entropy-based cache compression technique tailored for LLMs. Ecco combines group-wise and non-uniform quantization with pre-defined shared k-means patterns and Huffman coding to exploit the inherent entropy characteristics of LLM cache data. Recognizing the inefficiencies of traditional Huffman coding in terms of parallelism and latency, we introduce a novel parallel Huffman-based decoding process with a multi-stage pipeline design, reducing latency by two orders of magnitude and achieving throughput comparable to GPU L2 caches. Comprehensive evaluations demonstrate that Ecco achieves an up to 2.9$ imes$ and 1.9$ imes$ speedup over the state-of-the-art AWQ and SmoothQuant framework, 2.4$ imes$ over the Olive accelerator, all while increasing memory capacity by nearly 4$ imes$ and maintaining state-of-the-art LLM accuracy. These results underscore the effectiveness of our entropy-based cache compression in enhancing LLM performance and efficiency, paving the way for more deployable large-scale AI models.
Problem

Research questions and friction points this paper is trying to address.

Reducing memory bandwidth and capacity demands for LLMs
Overcoming high runtime overheads in quantization techniques
Maintaining accuracy while improving cache compression efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Entropy-based cache compression for LLMs
Parallel Huffman decoding with multi-stage pipeline
Combines group-wise and non-uniform quantization
🔎 Similar Papers
No similar papers found.
F
Feng Cheng
Duke University
C
Cong Guo
Duke University
Chiyue Wei
Chiyue Wei
Ph.D. student at ECE, Duke University
Computer ArchitectureDeep Learning
J
Junyao Zhang
Duke University
Changchun Zhou
Changchun Zhou
Duke University
AI Chips
E
Edward Hanson
Advanced Micro Devices, Inc.
J
Jiaqi Zhang
Advanced Micro Devices, Inc.
Xiaoxiao Liu
Xiaoxiao Liu
Principal Staff Engineer, AMD
Emerging MemoryHeterogeneous SystemNeuromorphic ComputingComputer Architecture
H
HaiHelenLi
Duke University
Y
Yiran Chen
Duke University