QG-CoC: Question-Guided Chain-of-Captions for Large Multimodal Models

📅 2025-11-05
🏛️ Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) face two key challenges in multi-image understanding: weak fine-grained perception and insufficient cross-image reasoning—limitations exacerbated by existing methods that often restrict input to single images or domain-specific scenarios. To address this, we propose a general zero-shot prompting framework enabling collaborative understanding of arbitrary numbers of images without model fine-tuning. Our core innovation is a query-guided chain captioning mechanism: it dynamically generates stepwise, query-relevant visual descriptions and integrates multi-stage attention modulation with contextual coherence modeling to jointly optimize perception and reasoning. We evaluate our method across multiple open- and closed-source MLLMs, demonstrating consistent and significant improvements over state-of-the-art prompting strategies. Notably, it achieves superior robustness and generalization on challenging multi-image reasoning benchmarks, including those requiring complex spatial, temporal, or relational inference.

Technology Category

Application Category

📝 Abstract
Recently, Multimodal Large Language Models (MLLMs) encounter two key issues in multi-image contexts: (1) a lack of fine-grained perception across disparate images, and (2) a diminished capability to effectively reason over and synthesize information from multiple visual inputs. However, while various prompting methods aim to describe visual content, many existing studies focus primarily on single-image settings or specific, constrained scenarios. This leaves a critical gap in understanding and addressing how MLLMs tackle more general and complex multi-image reasoning tasks. Thus, we first extensively investigate how current prompting methods perceive fine-grained visual details and process visual information when dealing with multiple images. Our findings reveal that existing prompting methods fall short in attending to needed clues and seamlessly integrating perception and reasoning. Inspired by the findings, we propose a new zero-shot prompting method, Question-Guided Chain-of-Captions (QG-CoC), a generalized prompting approach that effectively handles problems with an arbitrary number of images. We evaluate our method on various open-source and closed-source MLLMs for multi-image and single-image benchmarks. Experimental results indicate that QG-CoC demonstrates competitive performance across tasks and exhibits robust improvements in the challenging scenarios where existing prompting methods fail.
Problem

Research questions and friction points this paper is trying to address.

Addresses fine-grained perception limitations in multi-image multimodal models
Enhances reasoning capability across multiple visual inputs in MLLMs
Provides generalized prompting for complex multi-image tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Question-guided chain-of-captions for multimodal models
Zero-shot prompting method handling multiple images
Integrates visual perception with reasoning capabilities
🔎 Similar Papers
No similar papers found.
Kuei-Chun Kao
Kuei-Chun Kao
UCLA CS
LLMMultimodal Language AgentsHAIHuman Computer InteractionNLP
H
Hsu Tzu-Yin
Department of Computer Science, University of California, Los Angeles
Yunqi Hong
Yunqi Hong
University of California, Los Angeles
LLM post-trainingMultimodal LLM
R
Ruochen Wang
Department of Computer Science, University of California, Los Angeles
Cho-Jui Hsieh
Cho-Jui Hsieh
University of California, Los Angeles
Machine LearningOptimization