🤖 AI Summary
This work addresses the efficiency bottlenecks of large language models (LLMs) under asymmetric inputs, small batch sizes, and nonlinear operations by proposing a value-centric Value-Level Parallelism (VLP) strategy. Integrating nonlinear approximation, weight-only quantization, KV cache quantization, and grouped-query attention, the authors design Mugi, a specialized hardware architecture that efficiently supports asymmetric GEMM and small-batch inference. Their approach achieves up to 45× higher throughput and 668× better energy efficiency in softmax operations. When evaluated on full LLM inference, Mugi delivers 2.07× higher throughput and 3.11× improved energy efficiency, while reducing operational and embodied carbon emissions by 1.45× and 1.48×, respectively.
📝 Abstract
Value level parallelism (VLP) has been proposed to improve the efficiency of large-batch, low-precision general matrix multiply (GEMM) between symmetric activations and weights. In transformer based large language models (LLMs), there exist more sophisticated operations beyond activation-weight GEMM. In this paper, we explore how VLP benefits LLMs. First, we generalize VLP for nonlinear approximations, outperforming existing nonlinear approximations in end-to-end LLM accuracy, performance, and efficiency. Our VLP approximation follows a value-centric approach, where important values are assigned with greater accuracy. Second, we optimize VLP for small-batch GEMMs with asymmetric inputs efficiently, which leverages timely LLM optimizations, including weight-only quantization, key-value (KV) cache quantization, and group query attention. Finally, we design a new VLP architecture, Mugi, to encapsulate the innovations above and support full LLM workloads, while providing better performance, efficiency and sustainability. Our experimental results show that Mugi can offer significant improvements on throughput and energy efficiency, up to $45\times$ and $668\times$ for nonlinear softmax operations, and $2.07\times$ and $3.11\times$ for LLMs, and also decrease operational carbon for LLM operation by $1.45\times$ and embodied carbon by $1.48\times$.