Cross-Model Disagreement as a Label-Free Correctness Signal

πŸ“… 2026-03-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of reliably detecting incorrect yet high-confidence responses from large language models in the absence of ground-truth labels. The authors propose leveraging cross-model disagreement as an unsupervised signal of answer correctness by introducing Cross-Model Perplexity (CMP) and Cross-Model Entropy (CME)β€”metrics derived from the perplexity and entropy of a verification model on the generated answers. Requiring only a single forward pass and no additional training or content generation, this approach uniquely transforms inter-model divergence into a training-free, label-free uncertainty estimator. Evaluated on benchmarks including MMLU, TriviaQA, and GSM8K, the method substantially outperforms intrinsic uncertainty baselines; notably, CMP achieves an AUROC of 0.75 on MMLU, significantly exceeding the baseline of 0.59.

Technology Category

Application Category

πŸ“ Abstract
Detecting when a language model is wrong without ground truth labels is a fundamental challenge for safe deployment. Existing approaches rely on a model's own uncertainty -- such as token entropy or confidence scores -- but these signals fail critically on the most dangerous failure mode: confident errors, where a model is wrong but certain. In this work we introduce cross-model disagreement as a correctness indicator -- a simple, training-free signal that can be dropped into existing production systems, routing pipelines, and deployment monitoring infrastructure without modification. Given a model's generated answer, cross-model disagreement computes how surprised or uncertain a second verifier model is when reading that answer via a single forward pass. No generation from the verifying model is required, and no correctness labels are needed. We instantiate this principle as Cross-Model Perplexity (CMP), which measures the verifying model's surprise at the generating model's answer tokens, and Cross-Model Entropy (CME), which measures the verifying model's uncertainty at those positions. Both CMP and CME outperform within-model uncertainty baselines across benchmarks spanning reasoning, retrieval, and mathematical problem solving (MMLU, TriviaQA, and GSM8K). On MMLU, CMP achieves a mean AUROC of 0.75 against a within-model entropy baseline of 0.59. These results establish cross-model disagreement as a practical, training-free approach to label-free correctness estimation, with direct applications in deployment monitoring, model routing, selective prediction, data filtering, and scalable oversight of production language model systems.
Problem

Research questions and friction points this paper is trying to address.

label-free correctness detection
confident errors
cross-model disagreement
language model reliability
uncertainty estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-model disagreement
label-free correctness
Cross-Model Perplexity
confident errors
training-free uncertainty
πŸ”Ž Similar Papers
No similar papers found.