🤖 AI Summary
This work addresses the significant performance degradation in ultra-low-bit post-training quantization caused by noisy Hessian curvature estimation due to scarce calibration data. To mitigate this issue, the authors propose DASH-Q, a novel framework that leverages a stabilized diagonal Hessian approximation combined with iteratively reweighted least squares to effectively suppress sampling noise under extremely limited calibration samples while preserving critical feature energy. The key innovation lies in discarding cross-channel dependencies—which are highly susceptible to noise—and instead relying solely on robust diagonal curvature information for quantization. Extensive experiments demonstrate that DASH-Q achieves an average zero-shot accuracy improvement of 7.01% across five large language models, with gains reaching up to 14.01%, and maintains remarkable stability even when calibration data is severely limited.
📝 Abstract
Large Language Models (LLMs) are widely used across many domains, but their scale makes deployment challenging. Post-Training Quantization (PTQ) reduces memory footprint without retraining by leveraging a small calibration set. Recent Hessian-based PTQ methods compensate quantization error via cross-channel dependencies, but such approaches degrade at low bit-widths due to noisy curvature estimates from limited calibration data. We propose DASH-Q, a robust PTQ framework using diagonal Hessian approximation and iterative weighted least squares. By discarding noise-prone dependencies, DASH-Q filters sampling noise while prioritizing the preservation of salient feature power. We outperform other PTQ baselines in ultra low-bit regime, improving zero-shot accuracy by 7.01% on average and up to 14.01% over the strongest baselines across five baseline LLM models, while showing robust and stable performance with very small calibration data.