🤖 AI Summary
This work addresses the memory bandwidth bottleneck encountered when performing QR decomposition of tall-and-skinny dense real matrices on GPUs. The authors systematically evaluate algorithms including Cholesky-QR2, SVQB, and Householder-based TSQR, and introduce two key optimizations: a Q-less QR strategy that avoids explicitly storing the orthogonal factor Q to reduce memory overhead, and a hybrid approach combining shared memory with a tree-based reduction structure to accelerate local computations. Experimental results on double-precision GPUs demonstrate that the optimized TSQR implementation significantly outperforms existing methods in memory- and compute-bound regimes and remains competitive with vendor-provided libraries, underscoring the critical role of specialized optimizations for QR decomposition of tall-and-skinny matrices.
📝 Abstract
We consider the problem of computing a QR (or QZ) decomposition of a real, dense, tall and very skinny matrix. That is, the number of columns is tiny compared to the number of rows, rendering most computations completely or partially memory-bandwidth limited. The paper focuses on recent NVIDIA GPGPUs still supporting 64-bit floating-point arithmetic, but the findings carry over to AMD GPUs as well. We discuss two basic algorithms: Methods based on the normal equations (Gram matrix), in particular Cholesky-QR2 and SVQB, and the "tall-skinny QR" (TSQR), based on Householder transformations in a tree-reduction scheme. We propose two primary optimization techniques: Avoiding the write-back of the Q factor ("Q-less QR"), and exploiting fast local memory (shared memory on GPUs). We compare a straight-forward implementation of Gramian-based methods, and a more sophisticated TSQR implementation, in terms of performance achieved, time-to-solution, and implementation complexity. By performance modelling and numerical experiments with our own code and a vendor-optimized library routine, we demonstrate the crucial need for specialized methods and implementations in this memory-bound to transitional (memory/compute-bound) regime, and that TSQR is competitive in terms of time-to-solution, but at the cost of an investment in low-level code optimization.