Learning to Rank Pre-trained Vision-Language Models for Downstream Tasks

📅 2024-12-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Selecting optimal pre-trained vision-language models (VLMs) for downstream tasks in fully unsupervised settings—i.e., without labeled downstream data, large-scale auxiliary databases, or external language models—remains an open challenge. Method: We formally introduce and define the novel task of *unsupervised VLM selection*. To address it, we propose VEGA: a zero-shot evaluation framework that constructs multi-granularity graph structures from visual and textual features, then quantifies cross-modal alignment via node-level and edge-level consistency metrics—requiring no annotations or modality-specific supervision. Results: Extensive experiments across three diverse benchmarks demonstrate that VEGA consistently predicts downstream VLM performance with significantly higher accuracy than existing zero-shot selection methods. It offers an interpretable, computationally efficient, and generalizable evaluation paradigm for resource-constrained VLM deployment.

Technology Category

Application Category

📝 Abstract
Vision language models (VLMs) like CLIP show stellar zero-shot capability on classification benchmarks. However, selecting the VLM with the highest performance on the unlabeled downstream task is non-trivial. Existing VLM selection methods focus on the class-name-only setting, relying on a supervised large-scale dataset and large language models, which may not be accessible or feasible during deployment. This paper introduces the problem of extbf{unsupervised vision-language model selection}, where only unsupervised downstream datasets are available, with no additional information provided. To solve this problem, we propose a method termed Visual-tExtual Graph Alignment (VEGA), to select VLMs without any annotations by measuring the alignment of the VLM between the two modalities on the downstream task. VEGA is motivated by the pretraining paradigm of VLMs, which aligns features with the same semantics from the visual and textual modalities, thereby mapping both modalities into a shared representation space. Specifically, we first construct two graphs on the vision and textual features, respectively. VEGA is then defined as the overall similarity between the visual and textual graphs at both node and edge levels. Extensive experiments across three different benchmarks, covering a variety of application scenarios and downstream datasets, demonstrate that VEGA consistently provides reliable and accurate estimates of VLMs' performance on unlabeled downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised Learning
Visual Language Models
Model Selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

VEGA
Unsupervised Model Selection
Visual Language Model Evaluation
🔎 Similar Papers
No similar papers found.
Y
Yuhe Ding
School of Computer Science and Technology, Anhui University
B
Bo Jiang
School of Computer Science and Technology, Anhui University
Aihua Zheng
Aihua Zheng
Anhui University
Q
Qin Xu
School of Computer Science and Technology, Anhui University
Jian Liang
Jian Liang
Kuaishou Inc.
transfer learninggraph learning