🤖 AI Summary
This work addresses the challenges of low-bit post-training quantization (PTQ) in autoregressive large language models, specifically the state-dependent activation distribution shifts and temporal error accumulation during iterative decoding. To mitigate these issues, the authors propose STaR-Quant, a novel framework that introduces State-Guided Activation Transformation (SGAT) and a lightweight Temporal Attention Compensation (TAC) mechanism. These components jointly enable dual-state activation space allocation and temporal correction under unified static weights. STaR-Quant is the first method to systematically resolve the state and temporal inconsistencies inherent in quantizing autoregressive language models. Extensive experiments demonstrate that it significantly outperforms existing PTQ baselines across mainstream architectures, achieving up to 1.69× faster inference and 3.14× memory savings.
📝 Abstract
Diffusion large language models (DLLMs) have recently emerged as a promising alternative to autoregressive LLMs by generating text through iterative masked denoising with bidirectional context. However, their large model sizes and iterative denoising process introduce substantial memory and computational overhead, motivating post-training quantization for efficient deployment. In this paper, we identify two key challenges for low-bit DLLM quantization: state-dependent activation disparity and temporal error accumulation. Masked and unmasked tokens exhibit different activation distributions within each denoising step, while quantization errors can accumulate across steps during iterative decoding. To address these challenges, we propose STaR-Quant, a state-time consistent PTQ framework for DLLMs. STaR-Quant introduces State-Guided Activation Transformation (SGAT) to assign masked and unmasked tokens to different activation transformation spaces with a unified static weight-side transformation. It further introduces Temporal Attention Compensation (TAC) to correct the quantized attention representation via a lightweight block-diagonal affine mapping. Experiments on representative DLLMs demonstrate that STaR-Quant consistently improves low-bit weight-activation quantization over strong PTQ baselines, while delivering up to 1.69x speedup and 3.14x memory saving over FP16 deployment.