MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks (e.g., BrowseComp) focus solely on textual understanding in web browsing, overlooking the pervasive multimodal nature of real-world web content. Method: We introduce MMBrowse—the first benchmark for multimodal web browsing—comprising 224 manually crafted, challenging tasks spanning joint image/video/text retrieval and cross-modal reasoning. It features a novel multimodal dependency evaluation framework and a fine-grained reasoning path verification checklist. To enable precise assessment of vision-language joint reasoning, we couple human-designed multimodal prompts with authentic webpage content. Contribution/Results: Our evaluation systematically exposes a critical bottleneck: current models lack native multimodal collaborative reasoning capabilities. Even state-of-the-art tool-augmented models (e.g., OpenAI o3) achieve only 29.02% accuracy, underscoring severe limitations in multimodal agent performance.

Technology Category

Application Category

📝 Abstract
AI agents with advanced reasoning and tool use capabilities have demonstrated impressive performance in web browsing for deep search. While existing benchmarks such as BrowseComp evaluate these browsing abilities, they primarily focus on textual information, overlooking the prevalence of multimodal content. To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising 224 challenging, hand-crafted questions specifically designed to assess agents' multimodal retrieval and reasoning capabilities. These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages. Consequently, methods relying solely on text prove insufficient for our benchmark. Additionally, we provide a verified checklist for each question, enabling fine-grained analysis of multimodal dependencies and reasoning paths. Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02% accuracy, highlighting the suboptimal multimodal capabilities and lack of native multimodal reasoning in current models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal web browsing agents' retrieval and reasoning capabilities
Addressing limitations of text-only benchmarks for web content interaction
Assessing image and video information processing in AI browsing agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal benchmark for browsing agents
Hand-crafted questions with image prompts
Verified checklist for reasoning analysis
🔎 Similar Papers
No similar papers found.