🤖 AI Summary
This work addresses whether compression techniques—such as quantization and pruning—preserve the uncertainty quantification capabilities of large language models (LLMs), a critical requirement for safety-critical applications. Introducing uncertainty fidelity as a novel evaluation criterion, the study establishes a unified benchmark based on conformal prediction to systematically assess 12 models across diverse compression configurations on five NLP tasks. The findings reveal that compression often decouples predictive accuracy from uncertainty reliability, that larger models exhibit greater robustness to compression, and that uncertainty degradation follows a threshold effect. These results demonstrate that evaluating compressed models solely on accuracy is severely insufficient and underscore the necessity of incorporating uncertainty-aware testing into standard evaluation pipelines to ensure safe deployment.
📝 Abstract
Model compression techniques such as quantization and pruning are widely used to reduce the deployment cost of large language models (LLMs), with existing evaluations focusing almost exclusively on accuracy preservation. However, in safety-critical applications, a model's ability to reliably quantify its own uncertainty is equally important. We ask: does compression preserve this ability? To answer this question, we benchmark 12 LLMs under various compression configurations across five NLP tasks, using conformal prediction to provide a rigorous, distribution-free measure of uncertainty. Our experiments reveal that: (I) compression frequently decouples accuracy from uncertainty; (II) larger models absorb compression-induced uncertainty far more effectively than smaller ones; and (III) uncertainty inflation is often threshold-like rather than gradual. These results suggest that accuracy-only evaluation is insufficient for assessing the deployment readiness of compressed LLMs, and that uncertainty-aware benchmarking should be a standard component of model compression pipelines.