π€ AI Summary
This work addresses the significant degradation in the error prediction performance of large language models when confronted with input ambiguity. It proposes a novel approach that explicitly models input ambiguity and integrates it into the error prediction pipeline by combining a gated mixture-of-experts architecture with a selective prediction mechanism. The method leverages both gold-standard and model-generated ambiguity labels and fuses six distinct uncertainty metrics. Evaluated across multiple models, datasets, and evaluation paradigms, the approach consistently yields substantial improvements: on standard question-answering benchmarks, several uncertainty metrics achieve relative prediction gains exceeding 10 percentage points. These results demonstrate that explicitly modeling input ambiguity is both effective and essential for enhancing error prediction capabilities in large language models.
π Abstract
The task of Error Prediction, namely predicting whether a model output is correct, is commonly tackled with Uncertainty Quantification (UQ). However, while uncertainty metrics capture when models lack knowledge or capacity to make a prediction, they also reflect aleatoric uncertainty, which is inherent in the model input and context. This paper presents a method for improving error prediction for Large Language Models (LLMs), by disentangling input ambiguity from UQ signal. We conduct experiments on the task of Question Answering (QA) with six UQ metrics and show that UQ metrics are more predictive of errors on unambiguous instances than on questions with multiple plausible answers. We use Gated Experts and Selective Prediction to incorporate gold and predicted ambiguity labels into the error prediction pipeline. We find that ambiguity information improves error prediction scores across model families, training and evaluation paradigms, datasets (including allegedly unambiguous ones), and sources of aleatoric uncertainty, yielding improvements of over 10 points of PRR for individual UQ metrics on standard datasets.