Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large multimodal models (LMMs) exhibit strong visual perception capabilities but lack high-level, task-specific compositional reasoning—hindering progress toward general visual intelligence. Method: We propose a human-inspired “Perceive–Reason–Answer” single-pass reasoning paradigm, enabling end-to-end visual reasoning within a single forward pass—without iterative inference, multi-step API calls, or external tools. Our approach extends large vision-language model architectures with integrated visual grounding, multi-granularity representation learning, and instruction tuning, trained on a newly curated 334K-sample high-quality visual instruction dataset. Contribution/Results: The resulting model, Griffon-R, achieves state-of-the-art performance on complex visual reasoning benchmarks (e.g., VSR, CLEVR) and substantially improves results across mainstream multimodal evaluations (e.g., MMBench, ScienceQA). It further demonstrates enhanced interpretability and response faithfulness, advancing both capability and reliability in visual language understanding.

Technology Category

Application Category

📝 Abstract
Large Multimodal Models (LMMs) have recently demonstrated remarkable visual understanding performance on both vision-language and vision-centric tasks. However, they often fall short in integrating advanced, task-specific capabilities for compositional reasoning, which hinders their progress toward truly competent general vision models. To address this, we present a unified visual reasoning mechanism that enables LMMs to solve complicated compositional problems by leveraging their intrinsic capabilities (e.g. grounding and visual understanding capabilities). Different from the previous shortcut learning mechanism, our approach introduces a human-like understanding-thinking-answering process, allowing the model to complete all steps in a single pass forwarding without the need for multiple inferences or external tools. This design bridges the gap between foundational visual capabilities and general question answering, encouraging LMMs to generate faithful and traceable responses for complex visual reasoning. Meanwhile, we curate 334K visual instruction samples covering both general scenes and text-rich scenes and involving multiple foundational visual capabilities. Our trained model, Griffon-R, has the ability of end-to-end automatic understanding, self-thinking, and reasoning answers. Comprehensive experiments show that Griffon-R not only achieves advancing performance on complex visual reasoning benchmarks including VSR and CLEVR, but also enhances multimodal capabilities across various benchmarks like MMBench and ScienceQA. Data, models, and codes will be release at https://github.com/jefferyZhan/Griffon/tree/master/Griffon-R soon.
Problem

Research questions and friction points this paper is trying to address.

Enhancing compositional reasoning in Large Multimodal Models
Bridging visual capabilities and general question answering
Developing end-to-end visual understanding and reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified visual reasoning mechanism for LMMs
Human-like understanding-thinking-answering process
End-to-end automatic understanding and reasoning
🔎 Similar Papers
No similar papers found.
Yufei Zhan
Yufei Zhan
Institute of Automation, Chinese Academy of Science
Computer VisionLarge Multimodal ModelsGrounding and Detection
H
Hongyin Zhao
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Yousong Zhu
Yousong Zhu
Associate Professor, Chinese Academy of Sciences, Institute of Automation
Multimodal Large Language ModelsSelf-supervised LearningObject Detection
S
Shurong Zheng
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China; Peng Cheng Laboratory, Shenzhen, China
F
Fan Yang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China; Peng Cheng Laboratory, Shenzhen, China
M
Ming Tang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
J
Jinqiao Wang
Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China; Peng Cheng Laboratory, Shenzhen, China; Wuhan AI Research, Wuhan, China