🤖 AI Summary
To address the inefficiency of answer selection in multi-LLM systems—commonly reliant on external verifiers, human evaluation, or repeated sampling—this paper proposes a lightweight uncertainty-aware answer selection method. It leverages calibrated log-likelihood scores derived from each model’s output to perform implicit uncertainty modeling, eliminating the need for auxiliary verifiers or redundant sampling. The approach uniformly supports both debate-based and non-debate reasoning paradigms. Empirical evaluation shows consistent improvements of approximately 4%, 3%, and 5% on GSM8K, MMLU, and ARC, respectively, significantly outperforming self-consistency and state-of-the-art multi-model selection baselines. The core contribution lies in the first systematic use of intra-model calibrated likelihood scores as a cross-model comparable reliability metric—enabling efficient, robust decision-making under resource constraints.
📝 Abstract
Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge, particularly in resource-constrained settings. Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model. While multi-LLM systems produce more diverse responses than single models and thus have greater potential, they often underperform compared to single LLM self-consistency. We propose a principled, novel and computationally efficient method to select the best response from multiple different LLMs using a calibrated log-likelihood score, implicitly leveraging the inherent knowledge and confidence of these models. Our method demonstrates improvements of approx. 4%, 3%, and 5% across both debate (multi-round LLM discussions) and non-debate (Best-of-N with multiple LLMs) settings on GSM8K, MMLU (6 subsets), and ARC datasets respectively.