🤖 AI Summary
This work addresses the poor parallelism and low hardware utilization in block-wise parallel linear attention on NPUs, which stem from forward substitution during matrix inversion. To overcome this, the authors propose a fast approximation algorithm based entirely on matrix multiplication. Specifically designed for strictly lower triangular matrices, the method combines truncated Neumann series expansion, structured masking, and parallel residual correction to eliminate sequential dependencies. It further incorporates low-bit quantization compatibility and block-aware optimization of the approximation order, achieving high hardware efficiency without compromising model accuracy. Evaluated on the Qwen3.5 model family, the approach delivers up to 5× kernel-level speedup, reduces decoder-layer overhead by 20%, and maintains consistent performance across both floating-point and low-precision inference settings.
📝 Abstract
Matrix inversion in chunk-wise parallel linear attention is a major bottleneck for long-context modeling, particularly on NPUs, where forward-substitution-based methods exhibit limited parallelism and poor hardware utilization. We propose a fast, Matrix Multiplication (MatMul)-based algorithm tailored for strictly lower-triangular matrices arising in chunk-wise linear attention. Motivated by the rapid growth of Neumann-series terms and the diagonal concentration of the inverse matrix, we employ a truncated Neumann expansion with structural masking and parallel residual correction to eliminate sequential dependencies. We further extend our method to low-bits INT by mitigating the dynamic range expansion arising from repeated matrix power operations, and adapt the approximation order and residual step to the chunk size to minimize computational cost while preserving the model's accuracy. Experiments on Qwen3.5-family models demonstrate up to 5$\times$ kernel-level speedup and a 20% reduction in decode-layer overhead, while preserving accuracy under both floating-point and low-precision inference. Our method offers an efficient and hardware-friendly solution for scalable linear attention.