MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

📅 2025-02-13

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing Chain-of-Thought (CoT) evaluation for Large Multimodal Models (LMMs) lacks systematic, multimodal benchmarking. Method: We introduce MME-CoT—the first comprehensive multimodal CoT benchmark—spanning six domains: mathematics, science, OCR, logic, spatiotemporal reasoning, and general knowledge. It features a fine-grained, three-dimensional evaluation framework assessing CoT quality, robustness (via adversarial perturbations), and efficiency (via latency analysis), supported by structured reasoning-path annotations. Contributions/Results: We find that reflection mechanisms improve CoT quality but degrade average performance on perception tasks by 12.7%; mainstream LMMs require 2.3× more time for self-correction than for initial responses, revealing severe inefficiency. Empirically, Kimi k1.5 outperforms GPT-4o in CoT quality. This work establishes a new paradigm and foundational benchmark for interpretable, optimized multimodal CoT analysis.

Technology Category

Application Category

📝 Abstract

Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation. In this paper, we introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR, logic, space-time, and general scenes. As the first comprehensive study in this area, we propose a thorough evaluation suite incorporating three novel metrics that assess the reasoning quality, robustness, and efficiency at a fine-grained level. Leveraging curated high-quality data and a unique evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights: 1) Models with reflection mechanism demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and demonstrating the highest quality results; 2) CoT prompting often degrades LMM performance on perception-heavy tasks, suggesting a potentially harmful overthinking behavior; and 3) Although the CoT quality is high, LMMs with reflection exhibit significant inefficiency in both normal response and self-correction phases. We hope MME-CoT serves as a foundation for advancing multimodal reasoning in LMMs. Project Page: https://mmecot.github.io/

Problem

Research questions and friction points this paper is trying to address.

Assessing CoT reasoning in Large Multimodal Models

Evaluating reasoning quality, robustness, and efficiency

Identifying overthinking behavior in perception-heavy tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking Chain-of-Thought in LMMs

Novel metrics for fine-grained evaluation

Leveraging curated data for insights

🔎 Similar Papers

No similar papers found.