Implementation of QR factorization of tall and very skinny matrices on current GPUs

📅 2026-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the memory bandwidth bottleneck encountered when performing QR decomposition of tall-and-skinny dense real matrices on GPUs. The authors systematically evaluate algorithms including Cholesky-QR2, SVQB, and Householder-based TSQR, and introduce two key optimizations: a Q-less QR strategy that avoids explicitly storing the orthogonal factor Q to reduce memory overhead, and a hybrid approach combining shared memory with a tree-based reduction structure to accelerate local computations. Experimental results on double-precision GPUs demonstrate that the optimized TSQR implementation significantly outperforms existing methods in memory- and compute-bound regimes and remains competitive with vendor-provided libraries, underscoring the critical role of specialized optimizations for QR decomposition of tall-and-skinny matrices.

Technology Category

Application Category

📝 Abstract
We consider the problem of computing a QR (or QZ) decomposition of a real, dense, tall and very skinny matrix. That is, the number of columns is tiny compared to the number of rows, rendering most computations completely or partially memory-bandwidth limited. The paper focuses on recent NVIDIA GPGPUs still supporting 64-bit floating-point arithmetic, but the findings carry over to AMD GPUs as well. We discuss two basic algorithms: Methods based on the normal equations (Gram matrix), in particular Cholesky-QR2 and SVQB, and the "tall-skinny QR" (TSQR), based on Householder transformations in a tree-reduction scheme. We propose two primary optimization techniques: Avoiding the write-back of the Q factor ("Q-less QR"), and exploiting fast local memory (shared memory on GPUs). We compare a straight-forward implementation of Gramian-based methods, and a more sophisticated TSQR implementation, in terms of performance achieved, time-to-solution, and implementation complexity. By performance modelling and numerical experiments with our own code and a vendor-optimized library routine, we demonstrate the crucial need for specialized methods and implementations in this memory-bound to transitional (memory/compute-bound) regime, and that TSQR is competitive in terms of time-to-solution, but at the cost of an investment in low-level code optimization.
Problem

Research questions and friction points this paper is trying to address.

QR factorization
tall-and-skinny matrices
GPU computing
memory-bandwidth limited
dense matrix decomposition
Innovation

Methods, ideas, or system contributions that make the work stand out.

TSQR
Q-less QR
memory-bandwidth optimization
GPU shared memory
tall-skinny QR
🔎 Similar Papers
No similar papers found.
Jonas Thies
Jonas Thies
Assistant Professor, TU Delft
numerical mathematicshigh performance computing
M
Melven Röhrig-Zöllner
German Aerospace Center, Institute of Software Technology