🤖 AI Summary
This work addresses the memory capacity, bandwidth, and interconnect bottlenecks in large-scale AI agent inference caused by long-context key-value (KV) caching, which conventional performance models fail to accurately capture. To systematically analyze non-compute bottlenecks under agent workloads, the authors propose a joint metric combining operational intensity (OI) and capacity footprint (CF). Guided by this analysis, they design a compute-memory decoupled architecture tailored for heterogeneous systems, integrating agent-hardware co-design, dedicated prefill/decode accelerators, and optical-interconnect-enabled memory expansion to effectively mitigate the memory wall. Experiments across representative techniques—including grouped-query attention (GQA), multi-head latent attention (MLA), mixture-of-experts (MoE), and quantization—demonstrate significant improvements in scalability and energy efficiency.
📝 Abstract
AI agent inference is driving an inference heavy datacenter future and exposes bottlenecks beyond compute - especially memory capacity, memory bandwidth and high-speed interconnect. We introduce two metrics - Operational Intensity (OI) and Capacity Footprint (CF) - that jointly explain regimes the classic roofline analysis misses, including the memory capacity wall. Across agentic workflows (chat, coding, web use, computer use) and base model choices (GQA/MLA, MoE, quantization), OI/CF can shift dramatically, with long context KV cache making decode highly memory bound. These observations motivate disaggregated serving and system level heterogeneity: specialized prefill and decode accelerators, broader scale up networking, and decoupled compute-memory enabled by optical I/O. We further hypothesize agent-hardware co design, multiple inference accelerators within one system, and high bandwidth, large capacity memory disaggregation as foundations for adaptation to evolving OI/CF. Together, these directions chart a path to sustain efficiency and capability for large scale agentic AI inference.