TurboBoA: Faster and Exact Attention-aware Quantization without Backpropagation

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This work addresses the accuracy degradation and efficiency bottlenecks commonly encountered in low-bit post-training quantization, which often stem from neglecting inter-layer dependencies. The authors propose an attention-aware quantization method that operates without backpropagation, jointly quantizing multiple output channels of attention modules. By introducing a closed-form error compensation rule and a mechanism for correcting errors propagated from preceding layers, the approach effectively suppresses outliers and enhances model accuracy. Furthermore, adaptive grid optimization via coordinate descent is integrated to substantially accelerate the quantization process. The method achieves state-of-the-art accuracy in both weight-only and weight-activation quantization settings, while offering over threefold speedup compared to the BoA baseline.

Technology Category

Application Category

📝 Abstract

The rapid growth of large language models (LLMs) has heightened the importance of post-training quantization (PTQ) for reducing memory and computation costs. Among PTQ methods, GPTQ has gained significant attention for its efficiency, enabling billion-scale LLMs to be quantized within a few GPU hours. However, GPTQ's assumption of layer-wise independence leads to severe accuracy drops in low-bit regimes. Recently, BoA improved upon GPTQ by incorporating inter-layer dependencies within attention modules, but its reliance on sequential quantization across all out-channels makes it substantially less efficient. In this paper, we propose TurboBoA, a new backpropagation-free PTQ algorithm that preserves the accuracy benefits of BoA while significantly accelerating the process. The proposed TurboBoA introduces three key innovations: (i) joint quantization of multiple out-channels with a closed-form error compensation rule, which reduces sequential bottlenecks and yields more than a three-fold speedup; (ii) a correction mechanism for errors propagated from preceding quantized layers; and (iii) adaptive grid computation with coordinate descent refinement to maintain alignment during iterative updates. Extensive experiments demonstrate that TurboBoA delivers substantial acceleration over BoA while consistently improving accuracy. When combined with outlier suppression techniques, it achieves state-of-the-art results in both weight-only and weight-activation quantization. The code will be available at https://github.com/SamsungLabs/TurboBoA.

Problem

Research questions and friction points this paper is trying to address.

post-training quantization

large language models

attention-aware quantization

low-bit quantization

quantization efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

post-training quantization

attention-aware quantization

joint channel quantization