๐ค AI Summary
This work addresses the challenge of cross-domain deployment without target labels, where existing approaches relying on a single vision-language model (VLM) suffer from compatibility limitations. The paper proposes a training-free unified framework that, for the first time, jointly models model selection, adaptation, and ensemble as a unified estimation of the latent trustworthy sample-to-class structure in the target domain. Built upon adaptive optimal transport theory, the method integrates semantic and visual reliability assessment, conditional classifier fitting under transport constraints, and probabilistic ensemble strategies, enabling end-to-end collaborative optimization while keeping VLM parameters frozen. Experiments demonstrate that the framework significantly improves model ranking accuracy, adaptation stability, and ensemble robustness across natural image, remote sensing, and histopathology datasets, with particularly pronounced advantages when applied to heterogeneous VLM pools.
๐ Abstract
Vision-language models (VLMs) enable visual recognition from semantic class descriptions, which makes them attractive when target annotations are scarce or unavailable. Most deployment pipelines, however, first choose a single VLM and then adapt that model to the unlabeled target set. This single-backbone paradigm hides a critical assumption: the selected VLM is already compatible with the target domain. In realistic cross-domain deployment, several general-purpose and domain-specialized VLMs may be plausible, yet no instance-level target labels are available to identify the reliable ones. Deployment therefore requires a coupled solution for model selection, target adaptation, and prediction integration. We revisit this problem from a system-level multi-VLM perspective. Our central observation is that the three decisions above depend on the same latent object: a trustworthy sample-class structure in the target set. Different VLMs may encode different transfer biases and produce conflicting predictions, but their outputs can still provide complementary evidence for estimating this structure. We propose One Stone, Three Birds, a training-free framework based on self-adaptive optimal transport. Given a pool of frozen candidate VLMs, OSTB estimates a consensus sample-to-class transport plan without updating VLM parameters. The learned transport structure is then reused for all deployment objectives: model selection is performed by ranking the combined semantic and visual reliability induced by the consensus plan; target adaptation is obtained by fitting transport-conditioned visual classifiers; and ensembling is implemented through reliability-aware probabilistic integration. Extensive experiments on natural-image, remote-sensing, and medical-pathology benchmarks show that OSTB improves model ranking, adaptation stability, and ensemble robustness under heterogeneous candidate pools.