Performance-Portable Optimization and Analysis of Multiple Right-Hand Sides in a Lattice QCD Solver

📅 2026-01-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost and memory bandwidth bottlenecks encountered in solving sparse linear systems with multiple right-hand sides (RHS) in lattice QCD solvers. Building upon the DD-αAMG framework, it extends support for the Wilson-Dirac operator coupled with GMRES—both with and without even-odd preconditioning—and introduces a configurable data layout to enhance data locality, transfer efficiency, and SIMD utilization. The study presents the first evaluation of Arm’s Scalable Matrix Extension (SME) instruction set for accelerating lattice QCD computations, offering in-depth analysis of how hardware architecture and compiler optimizations jointly influence performance. Experimental results demonstrate consistent and significant speedups on both x86 and Arm platforms, confirming the performance portability of the proposed optimizations and providing critical insights for lattice QCD computations on heterogeneous architectures.

Technology Category

Application Category

📝 Abstract
Managing the high computational cost of iterative solvers for sparse linear systems is a known challenge in scientific computing. Moreover, scientific applications often face memory bandwidth constraints, making it critical to optimize data locality and enhance the efficiency of data transport. We extend the lattice QCD solver DD-$\alpha$AMG to incorporate multiple right-hand sides (rhs) for both the Wilson-Dirac operator evaluation and the GMRES solver, with and without odd-even preconditioning. To optimize auto-vectorization, we introduce a flexible interface that supports various data layouts and implement a new data layout for better SIMD utilization. We evaluate our optimizations on both x86 and Arm clusters, demonstrating performance portability with similar speedups. A key contribution of this work is the performance analysis of our optimizations, which reveals the complexity introduced by architectural constraints and compiler behavior. Additionally, we explore different implementations leveraging a new matrix instruction set for Arm called SME and provide an early assessment of its potential benefits.
Problem

Research questions and friction points this paper is trying to address.

lattice QCD
multiple right-hand sides
performance portability
memory bandwidth
iterative solvers
Innovation

Methods, ideas, or system contributions that make the work stand out.

multiple right-hand sides
performance portability
SIMD vectorization
data layout optimization
SME
🔎 Similar Papers
No similar papers found.
S
Shiting Long
KTH Royal Institute of Technology, Stockholm, Sweden
G
Gustavo Ramirez-Hidalgo
Forschungszentrum Jülich GmbH, Jülich, Germany
S
Stepan Nassyr
Forschungszentrum Jülich GmbH, Jülich, Germany
J
Jose Jimenez-Merchan
University of Wuppertal, Wuppertal, Germany
Andreas Frommer
Andreas Frommer
Professor für Angewandte Informatik, Bergische Universität Wuppertal
Wissenschaftliches RechnenHöchstleistungsrechnenNumerische Lineare Algebra
Dirk Pleiter
Dirk Pleiter
Rijksuniversiteit Groningen + KTH Royal Institute of Technology
Computer SciencePhysicsHPC