π€ AI Summary
In multithreaded, multicore systems, memory allocators suffer from cache pollution and cross-core synchronization overhead due to interleaving of metadata and user dataβcausing up to 2.7Γ performance variance. Existing hardware acceleration approaches (e.g., Mallacc, Memento) are hindered by poor multithreading support and bottlenecks in core-accelerator synchronization. This paper proposes a novel memory allocation architecture leveraging a lightweight, programmable auxiliary core: allocator metadata is isolated in a dedicated cache, allocation logic is offloaded to the auxiliary core, and an efficient cross-core synchronization mechanism is designed. The architecture is compatible with mainstream allocation algorithms and supports runtime policy updates, eliminating the synchronization rigidity of conventional accelerators. Experimental evaluation under multithreaded workloads demonstrates up to 1.75Γ speedup over Jemalloc and TCMalloc, and consistently outperforms state-of-the-art software and hardware solutions on average.
π Abstract
Memory allocation, though constituting only a small portion of the executed code, can have a "butterfly effect" on overall program performance, leading to significant and far-reaching impacts. Despite accounting for just approximately 5% of total instructions, memory allocation can result in up to a 2.7x performance variation depending on the allocator used. This effect arises from the complexity of memory allocation in modern multi-threaded multi-core systems, where allocator metadata becomes intertwined with user data, leading to cache pollution or increased cross-thread synchronization overhead. Offloading memory allocators to accelerators, e.g., Mallacc and Memento, is a potential direction to improve the allocator performance and mitigate cache pollution. However, these accelerators currently have limited support for multi-threaded applications, and synchronization between cores and accelerators remains a significant challenge.
We present SpeedMalloc, using a lightweight support-core to process memory allocation tasks in multi-threaded applications. The support-core is a lightweight programmable processor with efficient cross-core data synchronization and houses all allocator metadata in its own caches. This design minimizes cache conflicts with user data and eliminates the need for cross-core metadata synchronization. In addition, using a general-purpose core instead of domain-specific accelerators makes SpeedMalloc capable of adopting new allocator designs. We compare SpeedMalloc with state-of-the-art software and hardware allocators, including Jemalloc, TCMalloc, Mimalloc, Mallacc, and Memento. SpeedMalloc achieves 1.75x, 1.18x, 1.15x, 1.23x, and 1.18x speedups on multithreaded workloads over these five allocators, respectively.