🤖 AI Summary
Existing feature-similarity-based methods for multimodal in-context example retrieval struggle to effectively enhance large model reasoning performance. To address this limitation, this work proposes GRIP, a novel framework that introduces, for the first time, a learnable text-free visual retrieval mechanism. GRIP integrates visual features with feedback signals from large language models and employs contrastive learning to distinguish between examples that are beneficial versus detrimental to reasoning. Notably, the approach operates without relying on textual information, exhibits strong cross-model transferability, and supports both open-source models such as Qwen2.5-VL and Idefics2, as well as closed-source systems including GPT-4o and Gemini. Extensive experiments demonstrate that GRIP significantly outperforms conventional similarity-based retrieval methods across classification, image captioning, and visual question answering tasks.
📝 Abstract
In-Context Learning (ICL) has become a powerful mechanism for adapting Large Language Models (LLMs) to new tasks without fine-tuning. Extending this concept to Large Multimodal Models (LMMs), Multimodal In-Context Learning (M-ICL) relies on retrieving relevant examples, such as images, captions, or question-answer pairs, to guide predictions across tasks like classification, captioning, and visual question answering (VQA). Most existing approaches select in-context examples based on feature-space similarity, assuming that semantically similar samples provide the most useful context. However, our systematic analysis reveals that this assumption does not always hold: visually similar examples are not necessarily those that most effectively enhance in-context learning performance.
To address this, we propose the Guided Retrieval of In-context Prompts (GRIP), a learnable vision-only retrieval framework that leverages feedback from LMMs to identify examples that truly improve model predictions. GRIP learns to distinguish beneficial from detrimental in-context examples through contrastive training, refining retrieval beyond pure similarity. Across three multimodal tasks, namely classification, captioning, and VQA, GRIP improves consistently over similarity-based retrieval on Qwen2.5-VL-7B, with its strongest gains in classification on Idefics2-8B. Moreover, we demonstrate that retrievers trained with feedback from one open LMM can be transferred to other models without retraining, including closed-source GPT-4o and Gemini, enabling scalable and cost-efficient deployment of M-ICL. Code will be published upon acceptance.