🤖 AI Summary
This work addresses the inefficiency of low-bit hardware in large language model (LLM) inference, where activations typically require high precision (e.g., 8 bits), limiting hardware utilization. The authors propose a hardware-software co-design framework that exploits sub-precision redundancy in activations by decomposing each 2k-bit activation into a dense k-bit low-order component and a sparse k-bit high-order component compressed via a precision bitmap. A lightweight algorithm is introduced to enhance the sparsity of the high-order part. This approach enables efficient computation on k-bit data paths while preserving effective 2k-bit precision. Experiments on BitNet 3B, Llama2 7B, and Llama3 8B demonstrate 16–24.3% and 13.5–23.4% reductions in prefill and decode latency, respectively, along with 17–26.7% and 6.5–14.2% energy savings.
📝 Abstract
The rapid growth in sizes of Large language models (LLMs) results in high compute and memory costs during inference. Quantization has been a significant pathway to addressing this challenge. In the quest to push the limits of quantization, weights, which are static, can often be quantized aggressively (e.g. 4 bits), while activations often require higher precision (e.g., 8 bits) to preserve accuracy, forcing hardware to operate with higher-precision datapaths. We leverage the statistical property that a significant fraction of activations are concentrated around zero, resulting in sparsity in the higher-order bits. Our proposal, SPARQLe, is a hardware-software co-design framework that exploits this sub-precision redundancy in any given quantized model. SPARQLe represents each 2k-bit activation tensor as a dense k-bit LSB tensor and a sparse k-bit MSB tensor compressed with a precision bitmap, and proposes a lightweight algorithm to increase MSB sparsity. SPARQLe reduces activation memory traffic and enables efficient computation on k-bit datapaths while preserving 2k-bit activation accuracy. SPARQLe includes an accelerator that operates directly on this hybrid format with minimal control overheads. Across the BitNet 3B, Llama2 7B, and Llama3 8B models, SPARQLe reduces prefill latency by 16-24.3% and decode latency by 13.5-23.4%, with 17-26.7% and 6.5-14.2% lower prefill and decode energy, respectively. SPARQLe demonstrates that sub-precision activation sparsity offers an effective and complementary pathway towards efficient LLM inference.