🤖 AI Summary
Current large multimodal models (LMMs) face a fundamental limitation in complex visual reasoning tasks due to their reliance on single-step inference. To address this, we propose VoCoT—a novel object-centric, vision-anchored chain-of-thought framework enabling multi-step reasoning. Its core contributions are: (1) an object-level cross-modal shared reasoning path that achieves fine-grained interleaved alignment between visual and linguistic concepts; and (2) VolCano, a lightweight VoCoT instantiation (7B parameters) supporting low-resolution image inputs and seamlessly integrated into open-source LMM architectures via instruction tuning. Extensive experiments demonstrate that VolCano significantly outperforms state-of-the-art methods—including GPT-4V—on rigorous visual reasoning benchmarks such as CLEVR and EmbSpatial. To foster reproducibility and further research, we fully open-source the code, datasets, and model weights.
📝 Abstract
While large multi-modal models (LMMs) have exhibited impressive capabilities across diverse tasks, their effectiveness in handling complex tasks has been limited by the prevailing single-step reasoning paradigm. To this end, this paper proposes VoCoT, a multi-step Visually grounded object-centric Chain-of-Thought reasoning framework tailored for inference with LMMs. VoCoT is characterized by two key features: (1) object-centric reasoning paths that revolve around cross-modal shared object-level information, and (2) visually grounded representation of object concepts in a multi-modal interleaved and aligned manner, which effectively bridges the modality gap within LMMs during long-term generation. To adapt LMMs in reasoning with VoCoT, we further construct an instruction-tuning dataset. By combining VoCoT with the prevalent open-source LMM architectures, we develop a VoCoT-based model, VolCano. With only 7B parameters and limited input image resolution, VolCano demonstrates excellent performance across various scenarios. In benchmarks like CLEVR and EmbSpatial, which highly require complex reasoning capabilities, VolCano outperforms SOTA models, including powerful GPT-4V. Related code, data and models are released in https://github.com/RupertLuo/VoCoT.