A New Family of Thread to Core Allocation Policies for an SMT ARM Processor

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

In SMT-enabled ARM servers, thread co-location induces inter-application interference, yet existing Thread-to-Core (T2C) allocation strategies are constrained by the limited capabilities of ARM’s Performance Monitoring Unit (PMU) counters, hindering accurate performance stack construction and degrading prediction model fidelity. Method: We propose ISC (Instructions and Stalls Cycles), a novel performance stack that overcomes ARM hardware monitoring bottlenecks, and build SYNPA—a lightweight, cross-platform, machine learning–based T2C allocation framework leveraging ISC. Contribution/Results: Evaluated on multi-application workloads, SYNPA4 reduces average job turnaround time by 38% compared to Linux’s default scheduler and achieves threefold higher performance improvement than the state-of-the-art ARM-specific T2C strategy. Crucially, SYNPA demonstrates vendor-agnostic compatibility across diverse SMT-capable ARM processors, enabling practical deployment without hardware-specific tuning.

Technology Category

Application Category

📝 Abstract

Modern high-performance servers commonly integrate Simultaneous Multithreading (SMT) processors, which efficiently boosts throughput over single-threaded cores. Optimizing performance in SMT processors faces challenges due to the inter-application interference within each SMT core. To mitigate the interference, thread-to-core (T2C) allocation policies play a pivotal role. State-of-the-art T2C policies work in two steps: i) building a per-application performance stack using performance counters and ii) building performance prediction models to identify the best pairs of applications to run on each core. This paper explores distinct ways to build the performance stack in ARM processors and introduces the Instructions and Stalls Cycles (ISC) stack, a novel approach to overcome ARM PMU limitations. The ISC stacks are used as inputs for a performance prediction model to estimate the applications' performance considering the inter-application interference. The accuracy of the prediction model (second step) depends on the accuracy of the performance stack (first step); thus, the higher the accuracy of the performance stack, the higher the potential performance gains obtained by the T2C allocation policy. This paper presents SYNPA as a family of T2C allocation policies. Experimental results show that $SYNPA4$, the best-performing SYNPA variant, outperforms turnaround time by 38% over Linux, which represents 3$ imes$ the gains achieved by the state-of-the-art policies for ARM processors. Furthermore, the multiple discussions and refinements presented throughout this paper can be applied to other SMT processors from distinct vendors and are aimed at helping performance analysts build performance stacks for accurate performance estimates in real processors.

Problem

Research questions and friction points this paper is trying to address.

Optimizing thread-to-core allocation in SMT ARM processors

Mitigating inter-application interference via performance prediction models

Improving accuracy of performance stacks for better allocation policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ISC stack for ARM PMU limitations

Uses performance prediction model for interference

SYNPA4 improves turnaround time by 38%

🔎 Similar Papers

No similar papers found.