๐ค AI Summary
HPX suffers from suboptimal C++ Executor performance on heterogeneous hardware due to static resource allocation. To address this, we propose a cores-aware and chunking-aware adaptive executor model that dynamically monitors runtime load, models scheduling overhead, and heuristically adjusts task chunking and core bindingโenabling online optimization for both compute- and memory-bound workloads within HPX. Our design fully conforms to the standard C++20 Executor interface and requires no modifications to user code. Experimental evaluation across diverse hardware configurations and representative parallel workloads demonstrates speedups of 1.4โ2.3ร over baseline static strategies, confirming substantial performance gains while preserving portability and standards compliance.
๐ Abstract
C++ Executors simplify the development of parallel algorithms by abstracting concurrency management across hardware architectures. They are designed to facilitate portability and uniformity of user-facing interfaces; however, in some cases they may lead to performance inefficiencies duo to suboptimal resource allocation for a particular workload or not leveraging certain hardware-specific capabilities. To mitigate these inefficiencies we have developed a strategy, based on cores and chunking (workload), and integrated it into HPX's executor API. This strategy dynamically optimizes for workload distribution and resource allocation based on runtime metrics and overheads. In this paper, we introduce the model behind this strategy and evaluate its efficiency by testing its implementation (as an HPX executor) on both compute-bound and memory-bound workloads. The results show speedups across all tests, configurations, and workloads studied. offering improved performance through a familiar and user-friendly C++ executor API.