MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

📅 2024-08-23

🏛️ arXiv.org

📈 Citations: 11

✨ Influential: 1

🤖 AI Summary

Existing multimodal large language model (MLLM) benchmarks suffer from limited scale, reliance on model-generated annotations, low image resolution, and insufficient task difficulty, thus failing to reflect real-world challenges. To address these limitations, this work introduces HR-MMBench—the first large-scale, human-annotated, high-resolution multimodal benchmark—comprising 13,366 high-quality images and 29,429 expert-curated question-answer pairs across five categories of human-challenging real-world scenarios. HR-MMBench uniquely emphasizes high-resolution visual perception and complex real-world reasoning. Its annotation pipeline involves 25 annotators and 7 MLLM domain experts, covering 43 fine-grained subtasks with rigorous quality control. Comprehensive evaluation across 28 state-of-the-art MLLMs reveals that even the strongest model achieves less than 60% accuracy, underscoring persistent bottlenecks in high-resolution visual understanding and realistic multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Comprehensive evaluation of Multimodal Large Language Models (MLLMs) has recently garnered widespread attention in the research community. However, we observe that existing benchmarks present several common barriers that make it difficult to measure the significant challenges that models face in the real world, including: 1) small data scale leads to a large performance variance; 2) reliance on model-based annotations results in restricted data quality; 3) insufficient task difficulty, especially caused by the limited image resolution. To tackle these issues, we introduce MME-RealWorld. Specifically, we collect more than $300$K images from public datasets and the Internet, filtering $13,366$ high-quality images for annotation. This involves the efforts of professional $25$ annotators and $7$ experts in MLLMs, contributing to $29,429$ question-answer pairs that cover $43$ subtasks across $5$ real-world scenarios, extremely challenging even for humans. As far as we know, MME-RealWorld is the largest manually annotated benchmark to date, featuring the highest resolution and a targeted focus on real-world applications. We further conduct a thorough evaluation involving $28$ prominent MLLMs, such as GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet. Our results show that even the most advanced models struggle with our benchmarks, where none of them reach $60%$ accuracy. The challenges of perceiving high-resolution images and understanding complex real-world scenarios remain urgent issues to be addressed. The data and evaluation code are released at https://mme-realworld.github.io/ .

Problem

Research questions and friction points this paper is trying to address.

Evaluates Multimodal LLMs in high-resolution real-world scenarios.

Addresses data scale, quality, and task difficulty issues.

Challenges advanced models with complex, human-like tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale high-quality image collection

Expert and professional annotators involvement

Comprehensive evaluation of multimodal models

🔎 Similar Papers

No similar papers found.

Authors to Follow