Parallel In-context Learning for Large Vision Language Models

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the computational bottleneck in multimodal in-context learning, where increasing the number of demonstration examples improves performance but incurs significant inference latency due to the quadratic complexity of Transformer attention mechanisms. To mitigate this, the authors propose Parallel In-Context Learning (Parallel-ICL), which partitions a long context into diverse short segments via clustering and processes them in parallel. A similarity-aware, weighted Product-of-Experts ensemble is then applied at the logit level to approximate the output of the full context. Experiments on visual question answering, image captioning, and classification tasks demonstrate that Parallel-ICL achieves accuracy comparable to full-context methods while substantially improving inference efficiency.

Technology Category

Application Category

📝 Abstract

Large vision-language models (LVLMs) employ multi-modal in-context learning (MM-ICL) to adapt to new tasks by leveraging demonstration examples. While increasing the number of demonstrations boosts performance, they incur significant inference latency due to the quadratic computational cost of Transformer attention with respect to the context length. To address this trade-off, we propose Parallel In-Context Learning (Parallel-ICL), a plug-and-play inference algorithm. Parallel-ICL partitions the long demonstration context into multiple shorter, manageable chunks. It processes these chunks in parallel and integrates their predictions at the logit level, using a weighted Product-of-Experts (PoE) ensemble to approximate the full-context output. Guided by ensemble learning theory, we introduce principled strategies for Parallel-ICL: (i) clustering-based context chunking to maximize inter-chunk diversity and (ii) similarity-based context compilation to weight predictions by query relevance. Extensive experiments on VQA, image captioning, and classification benchmarks demonstrate that Parallel-ICL achieves performance comparable to full-context MM-ICL, while significantly improving inference speed. Our work offers an effective solution to the accuracy-efficiency trade-off in MM-ICL, enabling dynamic task adaptation with substantially reduced inference overhead.

Problem

Research questions and friction points this paper is trying to address.

in-context learning

vision-language models

inference latency

computational efficiency

multi-modal learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel In-Context Learning

Multi-modal In-Context Learning

Product-of-Experts