π€ AI Summary
This work addresses the overreliance of current Visual In-Context Learning (VICL) approaches on large-scale models and datasets, coupled with a lack of rigorous validation frameworks for adaptive capabilities. The authors propose a lightweight model with only 1 million parameters trained on 70,000 images to systematically evaluate VICLβs task encoding mechanisms, pretraining task selection, and evaluation metrics across distribution shifts and diverse tasks. Remarkably, despite using up to 7,000 times fewer parameters and training samples than mainstream methods, this minimal model exposes critical blind spots in existing VICL benchmarks regarding task design and evaluation metrics. By demonstrating that such a compact architecture can challenge prevailing large-model paradigms, the study advocates for more scientifically grounded and efficient standards for assessing adaptive learning in visual contexts.
π Abstract
Visual in-Context Learning (VICL) aims at making progress towards adaptive vision models, that can -- based on a few examples -- adapt to a new task at test-time. With the history of in-context learning in natural language processing research, where large, parameter-heavy models are in use, one pathway that current VICL methods take is model- and data-scaling as key ingredients. Yet, it is not clear, whether these ingredients are the key for in-context learning to take shape in vision models. To stress-test such large models, we challenge them with an extreme counterexample: we train a tiny visual in-context model with merely $1$ million parameters and a modest amount of $70,000$ images. We compare the results of this severely capacity capped tiny model to $7,000\times$ larger VICL models in different adaptive settings, (1) on image data with small distribution shifts, (2) on unseen task encodings and (3) on a completely new task, i.e., the setting VICL envisions. With the chasm of training resources between the tiny- and large models, our experiments showcase a lack in how adaptive capabilities are measured, with respect to how tasks are encoded, which tasks were used in pre-training and the choice of metrics. These gaps in current VICL benchmarking underscore a need for innovation in evaluation of adaptive capabilities.