🤖 AI Summary
Machine learning (ML) operator auto-tuning is severely constrained by the scarcity of target hardware, especially for emerging or embedded architectures.
Method: This paper proposes a cross-architecture simulation-prediction co-tuning framework leveraging instruction-level accurate simulators (QEMU, Spike, gem5). It is the first to empirically validate that simulator-derived statistical features can faithfully predict real-hardware performance. The framework employs a lightweight performance prediction model integrating regression and learning-to-rank techniques.
Contribution/Results: The method enables unified auto-tuning across heterogeneous platforms (x86, ARM, RISC-V) without physical devices. Experiments demonstrate 100% coverage of the true optimal configurations (top 3%); in embedded scenarios, only three parallel simulation samples suffice to surpass the tuning efficiency of native hardware search. This significantly improves scalability and parallel throughput of ML operator optimization.
📝 Abstract
Accelerating Machine Learning (ML) workloads requires efficient methods due to their large optimization space. Autotuning has emerged as an effective approach for systematically evaluating variations of implementations. Traditionally, autotuning requires the workloads to be executed on the target hardware (HW). We present an interface that allows executing autotuning workloads on simulators. This approach offers high scalability when the availability of the target HW is limited, as many simulations can be run in parallel on any accessible HW. Additionally, we evaluate the feasibility of using fast instruction-accurate simulators for autotuning. We train various predictors to forecast the performance of ML workload implementations on the target HW based on simulation statistics. Our results demonstrate that the tuned predictors are highly effective. The best workload implementation in terms of actual run time on the target HW is always within the top 3 % of predictions for the tested x86, ARM, and RISC-V-based architectures. In the best case, this approach outperforms native execution on the target HW for embedded architectures when running as few as three samples on three simulators in parallel.