SpeedMalloc: Improving Multi-threaded Applications via a Lightweight Core for Memory Allocation

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In multithreaded, multicore systems, memory allocators suffer from cache pollution and cross-core synchronization overhead due to interleaving of metadata and user data—causing up to 2.7× performance variance. Existing hardware acceleration approaches (e.g., Mallacc, Memento) are hindered by poor multithreading support and bottlenecks in core-accelerator synchronization. This paper proposes a novel memory allocation architecture leveraging a lightweight, programmable auxiliary core: allocator metadata is isolated in a dedicated cache, allocation logic is offloaded to the auxiliary core, and an efficient cross-core synchronization mechanism is designed. The architecture is compatible with mainstream allocation algorithms and supports runtime policy updates, eliminating the synchronization rigidity of conventional accelerators. Experimental evaluation under multithreaded workloads demonstrates up to 1.75× speedup over Jemalloc and TCMalloc, and consistently outperforms state-of-the-art software and hardware solutions on average.

Technology Category

Application Category

📝 Abstract

Memory allocation, though constituting only a small portion of the executed code, can have a "butterfly effect" on overall program performance, leading to significant and far-reaching impacts. Despite accounting for just approximately 5% of total instructions, memory allocation can result in up to a 2.7x performance variation depending on the allocator used. This effect arises from the complexity of memory allocation in modern multi-threaded multi-core systems, where allocator metadata becomes intertwined with user data, leading to cache pollution or increased cross-thread synchronization overhead. Offloading memory allocators to accelerators, e.g., Mallacc and Memento, is a potential direction to improve the allocator performance and mitigate cache pollution. However, these accelerators currently have limited support for multi-threaded applications, and synchronization between cores and accelerators remains a significant challenge. We present SpeedMalloc, using a lightweight support-core to process memory allocation tasks in multi-threaded applications. The support-core is a lightweight programmable processor with efficient cross-core data synchronization and houses all allocator metadata in its own caches. This design minimizes cache conflicts with user data and eliminates the need for cross-core metadata synchronization. In addition, using a general-purpose core instead of domain-specific accelerators makes SpeedMalloc capable of adopting new allocator designs. We compare SpeedMalloc with state-of-the-art software and hardware allocators, including Jemalloc, TCMalloc, Mimalloc, Mallacc, and Memento. SpeedMalloc achieves 1.75x, 1.18x, 1.15x, 1.23x, and 1.18x speedups on multithreaded workloads over these five allocators, respectively.

Problem

Research questions and friction points this paper is trying to address.

Optimizing memory allocation performance in multi-threaded applications

Reducing cache pollution caused by allocator metadata interference

Minimizing cross-thread synchronization overhead in memory operations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight core for memory allocation tasks

Efficient cross-core data synchronization design

General-purpose core avoids domain-specific limitations

🔎 Similar Papers

No similar papers found.

Authors to Follow