Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluations of visual in-context learning models are largely confined to settings closely aligned with their pretraining tasks or image domains, failing to reflect their true adaptability to novel tasks and distributions. To address this limitation, this work proposes VIBE, a unified benchmark that establishes the first cross-domain, cross-task one-shot evaluation framework, encompassing 106 combinations across 14 datasets and 12 task categories. Employing a standardized and reproducible protocol, the study conducts systematic stress tests on six representative model families. The experiments reveal significant performance gaps when models confront unseen tasks and domains, uncover critical failure modes and bottlenecks, and thereby provide clear directions for future research. An open-source toolkit is released alongside the benchmark to foster standardized community-wide evaluation.
📝 Abstract
Visual in-context learning has been proposed as a pathway towards dynamic models that can generate predictions based on a provided context and thereby can adapt to new vision tasks at test-time. Yet, the evaluation of the adaptation capabilities of these models has been limited to narrow setups that mainly mirror tasks or image domains from pre-training for which real adaptation is not required. We address this gap by constructing a broad Visual In-Context BEnchmark (VIBE) with a focus on diverse imaging domains and a wide range of tasks. With this, we are able to get a much clearer picture of the adaptive capabilities of visual in-context models when faced with new image- and task distributions. We stress test six models on $14$ datasets and $12$ tasks (in total, we explore $106$ dataset-task combinations) and compare them under a unified, reproducible evaluation protocol, in an one-shot setting. Our evaluation uncovers key insights on the state of visual in-context learning, including limitations, systematic failure modes and promising directions. To foster broader evaluation, we will openly release our VIBE toolkit.
Problem

Research questions and friction points this paper is trying to address.

visual in-context learning
adaptation capability
evaluation benchmark
out-of-distribution generalization
cross-domain tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual in-context learning
benchmark
domain generalization
task adaptation
few-shot evaluation