đ¤ AI Summary
In SMT-enabled ARM servers, thread co-location induces inter-application interference, yet existing Thread-to-Core (T2C) allocation strategies are constrained by the limited capabilities of ARMâs Performance Monitoring Unit (PMU) counters, hindering accurate performance stack construction and degrading prediction model fidelity.
Method: We propose ISC (Instructions and Stalls Cycles), a novel performance stack that overcomes ARM hardware monitoring bottlenecks, and build SYNPAâa lightweight, cross-platform, machine learningâbased T2C allocation framework leveraging ISC.
Contribution/Results: Evaluated on multi-application workloads, SYNPA4 reduces average job turnaround time by 38% compared to Linuxâs default scheduler and achieves threefold higher performance improvement than the state-of-the-art ARM-specific T2C strategy. Crucially, SYNPA demonstrates vendor-agnostic compatibility across diverse SMT-capable ARM processors, enabling practical deployment without hardware-specific tuning.
đ Abstract
Modern high-performance servers commonly integrate Simultaneous Multithreading (SMT) processors, which efficiently boosts throughput over single-threaded cores. Optimizing performance in SMT processors faces challenges due to the inter-application interference within each SMT core. To mitigate the interference, thread-to-core (T2C) allocation policies play a pivotal role. State-of-the-art T2C policies work in two steps: i) building a per-application performance stack using performance counters and ii) building performance prediction models to identify the best pairs of applications to run on each core.
This paper explores distinct ways to build the performance stack in ARM processors and introduces the Instructions and Stalls Cycles (ISC) stack, a novel approach to overcome ARM PMU limitations. The ISC stacks are used as inputs for a performance prediction model to estimate the applications' performance considering the inter-application interference. The accuracy of the prediction model (second step) depends on the accuracy of the performance stack (first step); thus, the higher the accuracy of the performance stack, the higher the potential performance gains obtained by the T2C allocation policy.
This paper presents SYNPA as a family of T2C allocation policies. Experimental results show that $SYNPA4$, the best-performing SYNPA variant, outperforms turnaround time by 38% over Linux, which represents 3$ imes$ the gains achieved by the state-of-the-art policies for ARM processors. Furthermore, the multiple discussions and refinements presented throughout this paper can be applied to other SMT processors from distinct vendors and are aimed at helping performance analysts build performance stacks for accurate performance estimates in real processors.