What do vision-language models see in the context? Investigating multimodal in-context learning

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work systematically investigates the effectiveness and limitations of vision-language models (VLMs) in multimodal in-context learning (ICL). To this end, the authors evaluate seven state-of-the-art VLMs across three image captioning tasks and, for the first time, quantitatively characterize their attention dynamics across varying numbers of in-context examples. Results reveal that while pretraining on interleaved image-text data improves ICL performance, VLMs remain heavily biased toward textual cues and fail to achieve substantive cross-modal fusion. Moreover, instruction fine-tuning enhances instruction-following capability but markedly reduces reliance on contextual examples—uncovering a fundamental trade-off between instruction alignment and context adaptation. Based on these findings, the paper proposes a novel evaluation paradigm for multimodal ICL and provides theoretical insights and practical optimization directions toward building truly collaborative, perception-integrated VLMs.

Technology Category

Application Category

📝 Abstract

In-context learning (ICL) enables Large Language Models (LLMs) to learn tasks from demonstration examples without parameter updates. Although it has been extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic study of ICL in VLMs, evaluating seven models spanning four architectures on three image captioning benchmarks. We analyze how prompt design, architectural choices, and training strategies influence multimodal ICL. To our knowledge, we are the first to analyze how attention patterns in VLMs vary with an increasing number of in-context demonstrations. Our results reveal that training on imag-text interleaved data enhances ICL performance but does not imply effective integration of visual and textual information from demonstration examples. In contrast, instruction tuning improves instruction-following but can reduce reliance on in-context demonstrations, suggesting a trade-off between instruction alignment and in-context adaptation. Attention analyses further show that current VLMs primarily focus on textual cues and fail to leverage visual information, suggesting a limited capacity for multimodal integration. These findings highlight key limitations in the ICL abilities of current VLMs and provide insights for enhancing their ability to learn from multimodal in-context examples.

Problem

Research questions and friction points this paper is trying to address.

Investigating multimodal in-context learning effectiveness in Vision-Language Models

Analyzing how attention patterns change with increasing demonstration examples

Evaluating visual-textual integration limitations in current VLM architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated seven vision-language models on image captioning benchmarks

Analyzed attention patterns with increasing in-context demonstrations

Revealed training strategies' trade-offs for multimodal integration

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs