🤖 AI Summary
This work addresses the challenge of high error rates among non-rejected samples in selective classification for binary contextual tasks with large language models, a problem stemming from the misalignment between confidence estimates and actual predictions. To mitigate this issue, the paper introduces, for the first time, a pairwise querying mechanism into selective classification: an additional pairwise comparison query is issued to the same model to identify high-risk instances, which are then incorporated into the rejection decision process. Theoretical analysis demonstrates that this approach outperforms the original confidence-based estimator under conditions of confidence–accuracy mismatch. Empirical evaluation on one synthetic and four real-world binary classification datasets confirms that the proposed method significantly improves the trade-off between accuracy and inference cost.
📝 Abstract
In selective classification, a model predicts the labels of data samples where it is confident, and abstains from predicting labels for samples on which it is not confident. The rejected samples are often labeled by an expert, which is expensive. The budget for the expert is best utilized when the model has low error on non-rejected samples. However, the estimate of a model's confidence might be inconsistent with the model's predictions, which can lead to high error on non-rejected points. Such situations can readily occur in in-context binary classification by LLMs. To remedy this, we propose making additional pairwise queries to the same model. These pairwise queries can detect high-error samples and be incorporated into selective classification techniques to reduce the error on non-rejected samples. Theoretically, we establish the conditions under which a simple algorithm using pairwise queries outperforms an inconsistent confidence estimate. We support this insight through extensive experiments for $1$ synthetic and $4$ in-context learning-based real binary classification datasets. In all these cases, we show that our algorithms, using pairwise queries, obtain a better accuracy-cost tradeoff than using only the raw confidence estimates, for instance, the LLM's next-token logits.