When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

213K/year
🤖 AI Summary
This work addresses the poor parallelism and low hardware utilization in block-wise parallel linear attention on NPUs, which stem from forward substitution during matrix inversion. To overcome this, the authors propose a fast approximation algorithm based entirely on matrix multiplication. Specifically designed for strictly lower triangular matrices, the method combines truncated Neumann series expansion, structured masking, and parallel residual correction to eliminate sequential dependencies. It further incorporates low-bit quantization compatibility and block-aware optimization of the approximation order, achieving high hardware efficiency without compromising model accuracy. Evaluated on the Qwen3.5 model family, the approach delivers up to 5× kernel-level speedup, reduces decoder-layer overhead by 20%, and maintains consistent performance across both floating-point and low-precision inference settings.
📝 Abstract
Matrix inversion in chunk-wise parallel linear attention is a major bottleneck for long-context modeling, particularly on NPUs, where forward-substitution-based methods exhibit limited parallelism and poor hardware utilization. We propose a fast, Matrix Multiplication (MatMul)-based algorithm tailored for strictly lower-triangular matrices arising in chunk-wise linear attention. Motivated by the rapid growth of Neumann-series terms and the diagonal concentration of the inverse matrix, we employ a truncated Neumann expansion with structural masking and parallel residual correction to eliminate sequential dependencies. We further extend our method to low-bits INT by mitigating the dynamic range expansion arising from repeated matrix power operations, and adapt the approximation order and residual step to the chunk size to minimize computational cost while preserving the model's accuracy. Experiments on Qwen3.5-family models demonstrate up to 5$\times$ kernel-level speedup and a 20% reduction in decode-layer overhead, while preserving accuracy under both floating-point and low-precision inference. Our method offers an efficient and hardware-friendly solution for scalable linear attention.
Problem

Research questions and friction points this paper is trying to address.

matrix inversion
linear attention
long-context modeling
hardware utilization
NPU
Innovation

Methods, ideas, or system contributions that make the work stand out.

matrix inversion approximation
linear attention
Neumann series
quantization
hardware-efficient
L
Luoming Zhang
Qualcomm AI Research, an initiative of Qualcomm Technologies, Inc.
Yuwei Ren
Yuwei Ren
Qualcomm
wireless communicationmachine learningsignal processing
K
Kui Zhang
Qualcomm AI Research, an initiative of Qualcomm Technologies, Inc.
T
Tian Liu
Qualcomm AI Research, an initiative of Qualcomm Technologies, Inc.
L
Lingjuan Ge
Qualcomm AI Research, an initiative of Qualcomm Technologies, Inc.
D
Denghao Li
Qualcomm AI Research, an initiative of Qualcomm Technologies, Inc.
M
Matthew Harper Langston
Qualcomm AI Research, an initiative of Qualcomm Technologies, Inc.
Yin Huang
Yin Huang
Research Assistant, University of Florida
Multi-Armed BanditsEdge ComputingWireless CommunicationsQuantum Networking
Weiliang Will Zeng
Weiliang Will Zeng
Qualcomm AI Research; Tsinghua University
GenAIDeep LearningOptimizationSignal ProcessingInformation Theory
Liang Zhang
Liang Zhang
Google
Computer systemsNetworkingSocial networks