ITME: Inference Tiered Memory Expansion with Disaggregated CXL-Hybrid Memories

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of terabyte-scale context state in large language model (LLM) inference, which often exceeds the memory capacity of a single machine and typically relies on complex, costly software optimizations. The authors propose the first use of Compute Express Link (CXL)-based heterogeneous memory for LLM context extension, constructing a byte-addressable, terabyte-scale remote memory system. By integrating a hierarchical memory architecture with deterministic access patterns, the system enables proactive data scheduling to efficiently manage the migration of model weights and prefix caches. A real hardware prototype built with SK Hynix CMM, PCIe Gen5 NVMe SSDs, and an FPGA demonstrates the feasibility of the approach. With NVMe-over-Fabrics (NVMe-oF) support and DPU-assisted acceleration, the design achieves up to 35.7% higher throughput compared to conventional CPU-offload solutions.

📝 Abstract

The rapid shift toward agentic and long-context workloads in Large Language Models (LLMs) is pushing the industry beyond the capacity of individual servers toward disaggregated shared storage to handle TB-scale context states. This movement has led to the emergence of specialized shared context layers designed to externalize and share cumulative inference states across distributed clusters. While offloading to a data processing unit (DPU) within just-a-bunch-of-flash (JBOF) architectures accelerates NVMe-over-fabrics (NVMe-oF) target processing, the need for sophisticated software-level optimization and cost-efficiency burdens remain significant. Consequently, the ideal architecture for scaling this shared context infrastructure is still an active area of exploration. In this paper, we propose ITME (Inference Tiered Memory Expansion), which leverages a CXL-hybrid memory to present a massive, TB-scale byte-addressable remote memory expansion. This approach enables cost-efficient scaling and simplifies the software stack through direct byte-addressability, effectively addressing the challenges of shared context infrastructure. Our key insight is that the deterministic access patterns of voluminous model weights and prefix caches enable the system to proactively manage data movement across the memory-storage hierarchy. We validate ITME by evaluating its performance potential with production-grade SK Hynix CMM and PCIe Gen5 NVMe SSDs, while further demonstrating its functional feasibility through an FPGA-based hardware prototype. Overall, ITME enhances conventional CPU-offloading by providing additional remote memory expansion to accommodate large KV cache footprints beyond host memory limits, achieving up to a 35.7\% throughput improvement.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

shared context infrastructure

memory expansion

disaggregated memory

CXL-hybrid memories

Innovation

Methods, ideas, or system contributions that make the work stand out.

CXL-hybrid memory

byte-addressable remote memory

tiered memory expansion