Bridging Domain Expertise and Generalization for Performance Estimation

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the challenging problem of model performance estimation under distribution shift when ground-truth labels are unavailable. The authors propose Fused Reference Alignment Prediction (FRAP), a novel approach that synergistically combines the generalization capability of external foundation models with the domain-specific expertise of the target task model. FRAP calibrates prediction distributions via temperature scaling, aligns them by minimizing KL divergence, and generates a robust, domain-adaptive pseudo-label reference distribution through confidence-weighted fusion. Extensive experiments demonstrate that FRAP consistently and substantially outperforms existing performance estimation methods across diverse datasets and model architectures, achieving stable and significant improvements.

📝 Abstract

Performance estimation under distribution shift aims to predict how a model behaves on an unlabeled test set whose distribution differs from the training data, a scenario that requires reliable indicators that can faithfully reflect model behavior without ground-truth labels. Existing approaches rely solely on the outputs of the given model whose biases are amplified once the distribution shifts, weakening the correlation with the true performance. Motivated by this limitation, we propose Fused Reference Alignment Prediction (FRAP), which leverages the complementary strengths of an external foundation model and the base model to construct a more reliable surrogate of the ground-truth labels. FRAP aligns the prediction distribution of the foundation model with that of the base model by applying temperature-scaled calibration that minimizes their divergence. The aligned predictions are fused through confidence-based weighting into a refined reference distribution that integrates robustness from the foundation model and domain-specific expertise from the base model, and performance estimation is obtained by measuring how closely the base model predictions agree with this reference. Extensive experiments across diverse datasets and architectures show that FRAP provides consistent and substantial improvements over representative performance-estimation methods under distribution shift.

Problem

Research questions and friction points this paper is trying to address.

performance estimation

distribution shift

domain expertise

generalization

unlabeled test set

Innovation

Methods, ideas, or system contributions that make the work stand out.

performance estimation

distribution shift

foundation model