VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

📅 2024-05-27

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Current large multimodal models (LMMs) face a fundamental limitation in complex visual reasoning tasks due to their reliance on single-step inference. To address this, we propose VoCoT—a novel object-centric, vision-anchored chain-of-thought framework enabling multi-step reasoning. Its core contributions are: (1) an object-level cross-modal shared reasoning path that achieves fine-grained interleaved alignment between visual and linguistic concepts; and (2) VolCano, a lightweight VoCoT instantiation (7B parameters) supporting low-resolution image inputs and seamlessly integrated into open-source LMM architectures via instruction tuning. Extensive experiments demonstrate that VolCano significantly outperforms state-of-the-art methods—including GPT-4V—on rigorous visual reasoning benchmarks such as CLEVR and EmbSpatial. To foster reproducibility and further research, we fully open-source the code, datasets, and model weights.

Technology Category

Application Category

📝 Abstract

While large multi-modal models (LMMs) have exhibited impressive capabilities across diverse tasks, their effectiveness in handling complex tasks has been limited by the prevailing single-step reasoning paradigm. To this end, this paper proposes VoCoT, a multi-step Visually grounded object-centric Chain-of-Thought reasoning framework tailored for inference with LMMs. VoCoT is characterized by two key features: (1) object-centric reasoning paths that revolve around cross-modal shared object-level information, and (2) visually grounded representation of object concepts in a multi-modal interleaved and aligned manner, which effectively bridges the modality gap within LMMs during long-term generation. To adapt LMMs in reasoning with VoCoT, we further construct an instruction-tuning dataset. By combining VoCoT with the prevalent open-source LMM architectures, we develop a VoCoT-based model, VolCano. With only 7B parameters and limited input image resolution, VolCano demonstrates excellent performance across various scenarios. In benchmarks like CLEVR and EmbSpatial, which highly require complex reasoning capabilities, VolCano outperforms SOTA models, including powerful GPT-4V. Related code, data and models are released in https://github.com/RupertLuo/VoCoT.

Problem

Research questions and friction points this paper is trying to address.

Enhances multi-step reasoning in large multi-modal models.

Bridges modality gaps with visually grounded object-centric reasoning.

Improves performance in complex reasoning tasks over SOTA models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-step visually grounded reasoning framework

Object-centric cross-modal shared information

Instruction-tuning dataset for LMM adaptation

🔎 Similar Papers

No similar papers found.