🤖 AI Summary
To address the lack of high-fidelity, verifiable system-level simulation tools for CXL-based heterogeneous memory systems, this paper introduces CXL-DMSim—the first open-source, full-system CXL memory-disaggregation simulator operating near gem5 speed. Its key contributions are: (1) a novel CXL disaggregation simulation framework supporting NUMA-aware kernel memory management; (2) an integrated, self-developed CXL.io/mem protocol stack, device drivers, and a flexible memory expansion model; and (3) dual-mode support (application- and kernel-managed) with fine-grained runtime observability. Validated against FPGA and ASIC hardware prototypes, CXL-DMSim achieves a mean error of only 3.4%. Experimental results show that CXL-ASIC memory latency is 2.18× that of DDR, while bandwidth reaches 82–83% of peak. Memory-intensive applications—including Viper and MERCI—achieve respective performance improvements of 23× and 60×.
📝 Abstract
Compute eXpress Link (CXL) has emerged as a key enabler of memory disaggregation for future heterogeneous computing systems to expand memory on-demand and improve resource utilization. However, CXL is still in its infancy stage and lacks commodity products on the market, thus necessitating a reliable system-level simulation tool for research and development. In this paper, we propose CXL-DMSim, an open-source full-system simulator to simulate CXL disaggregated memory systems with high fidelity at a gem5-comparable simulation speed. CXL-DMSim incorporates a flexible CXL memory expander model along with its associated device driver, and CXL protocol support with CXL.io and CXL.mem. It can operate in both app-managed mode and kernel-managed mode, with the latter using a dedicated NUMA-compatible mechanism. The simulator has been rigorously verified against a real hardware testbed with both FPGA- and ASIC-based CXL memory devices, which demonstrates the qualification of CXL-DMSim in simulating the characteristics of various CXL memory devices at an average simulation error of 3.4%. The experimental results using LMbench and STREAM benchmarks suggest that the CXL-FPGA memory exhibits a ~2.88x higher latency than local DDR while the CXL-ASIC latency is ~2.18x; CXL-FPGA achieves 45-69% of local DDR memory bandwidth, whereas the number for CXL-ASIC is 82-83%. The study also reveals that CXL memory can significantly enhance the performance of memory-intensive applications, improved by 23x at most with limited local memory for Viper key-value database and approximately 60% in memory-bandwidth-sensitive scenarios such as MERCI. Moreover, the simulator's observability and expandability are showcased with detailed case-studies, highlighting its great potential for research on future CXL-interconnected hybrid memory pool.