🤖 AI Summary
To address intra-chiplet memory access imbalance and inefficient task scheduling caused by partitioned L3 caches in chiplet-based CPUs, this paper proposes a lightweight adaptive runtime system that jointly optimizes task scheduling, memory allocation, and performance monitoring. Our approach introduces a chiplet-aware fine-grained task migration mechanism and a hardware-topology-aware memory allocation strategy—overcoming the limitations of conventional NUMA optimizations in chiplet architectures. It integrates chiplet-aware heuristic scheduling, a user-space lightweight concurrency model (supporting suspension/resumption and cross-chiplet task migration), and real-time performance monitoring. Experimental evaluation across diverse memory-intensive parallel applications demonstrates an average 1.7× speedup, a 22% improvement in L3 cache hit rate, and a 35% reduction in cross-chiplet memory access latency.
📝 Abstract
The growing disparity between CPU core counts and available memory bandwidth has intensified memory contention in servers. This particularly affects highly parallelizable applications, which must achieve efficient cache utilization to maintain performance as CPU core counts grow. Optimizing cache utilization, however, is complex for recent chiplet-based CPUs, whose partitioned L3 caches lead to varying latencies and bandwidths, even within a single NUMA domain. Classical NUMA optimizations and task scheduling approaches unfortunately fail to address the performance issues of chiplet-based CPUs. We describe Adaptive Runtime system for Chiplet-Aware Scheduling (ARCAS), a new runtime system designed for chiplet-based CPUs. ARCAS combines chiplet-aware task scheduling heuristics, hardware-aware memory allocation, and fine-grained performance monitoring to optimize workload execution. It implements a lightweight concurrency model that combines user-level thread features-such as individual stacks, per-task scheduling, and state management-with coroutine-like behavior, allowing tasks to suspend and resume execution at defined points while efficiently managing task migration across chiplets. Our evaluation across diverse scenarios shows ARCAS's effectiveness for optimizing the performance of memory-intensive parallel applications.