Are Generative Models Underconfident? An Embarrassingly Simple Quality Estimation Approach

📅 2025-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generative models often exhibit “under-confidence” in multi-solution tasks due to probability dispersion across semantically equivalent outputs, causing reference-free quality estimation (QE) based on raw sequence probability to severely underestimate true quality. To address this, we propose Dominant Quality Probability (DMP), a zero-overhead, training-free, and fine-tuning-free QE metric that statistically reconstructs and recalibrates model confidence over equivalent correct outputs. DMP operates solely on native token-level probabilities—introducing no additional parameters, modules, or architectural modifications—and is compatible with diverse autoregressive models (e.g., Whisper, Llama). Evaluated across translation, summarization, and other generation tasks, DMP improves average Pearson correlation with human ratings by 0.208 over baseline sequence probability, significantly enhancing alignment with human judgment. DMP establishes a simple, general, and robust new paradigm for reference-free QE.

Technology Category

Application Category

📝 Abstract
Quality Estimation (QE) is estimating the quality of model output when the ground truth reference is not available. Looking at model uncertainty from its own output probabilities is the most trivial and low-effort way to estimate the output quality. However, for generative model, output probabilities might not be the best quality estimator. At an output step, there can be multiple correct options, making the probability distribution spread out more. Thus, lower token probability does not necessarily mean lower output quality. In other words, the model can be considered underconfident. In this paper, we propose a QE approach called Dominant Mass Probability (DMP}, that boosts the model confidence in cases where there are multiple viable output options. We show that, with no increase in complexity, DMP is notably better than sequence probability when estimating the quality of different models (Whisper, Llama, etc.) on different tasks (translation, summarization, etc.). Compared to sequence probability, DMP achieves on average +0.208 improvement in Pearson correlation to ground-truth quality.
Problem

Research questions and friction points this paper is trying to address.

Evaluates generative models' confidence levels.
Proposes Dominant Mass Probability for quality estimation.
Improves correlation with ground-truth quality metrics.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dominant Mass Probability introduced
Boosts model confidence effectively
Improves Pearson correlation significantly
🔎 Similar Papers
No similar papers found.