Performance-Portable Optimization and Analysis of Multiple Right-Hand Sides in a Lattice QCD Solver

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the high computational cost and memory bandwidth bottlenecks encountered in solving sparse linear systems with multiple right-hand sides (RHS) in lattice QCD solvers. Building upon the DD-αAMG framework, it extends support for the Wilson-Dirac operator coupled with GMRES—both with and without even-odd preconditioning—and introduces a configurable data layout to enhance data locality, transfer efficiency, and SIMD utilization. The study presents the first evaluation of Arm’s Scalable Matrix Extension (SME) instruction set for accelerating lattice QCD computations, offering in-depth analysis of how hardware architecture and compiler optimizations jointly influence performance. Experimental results demonstrate consistent and significant speedups on both x86 and Arm platforms, confirming the performance portability of the proposed optimizations and providing critical insights for lattice QCD computations on heterogeneous architectures.

Technology Category

Application Category

📝 Abstract

Managing the high computational cost of iterative solvers for sparse linear systems is a known challenge in scientific computing. Moreover, scientific applications often face memory bandwidth constraints, making it critical to optimize data locality and enhance the efficiency of data transport. We extend the lattice QCD solver DD-$\alpha$AMG to incorporate multiple right-hand sides (rhs) for both the Wilson-Dirac operator evaluation and the GMRES solver, with and without odd-even preconditioning. To optimize auto-vectorization, we introduce a flexible interface that supports various data layouts and implement a new data layout for better SIMD utilization. We evaluate our optimizations on both x86 and Arm clusters, demonstrating performance portability with similar speedups. A key contribution of this work is the performance analysis of our optimizations, which reveals the complexity introduced by architectural constraints and compiler behavior. Additionally, we explore different implementations leveraging a new matrix instruction set for Arm called SME and provide an early assessment of its potential benefits.

Problem

Research questions and friction points this paper is trying to address.

lattice QCD

multiple right-hand sides

performance portability

memory bandwidth

iterative solvers

Innovation

Methods, ideas, or system contributions that make the work stand out.

multiple right-hand sides

performance portability

SIMD vectorization