Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of conflicting information across heterogeneous modalities in multimodal deep information retrieval, a limitation inadequately handled by existing agents that predominantly rely on linear evidence accumulation models. To overcome this, we propose Struct-Searcher, the first framework to integrate belief revision theory into multimodal agent reasoning, enabling dynamic maintenance of a multimodal structure graph with explicit conflict awareness. Struct-Searcher synergistically combines multimodal large language models with structured knowledge graphs, yielding a model-agnostic, evolvable, and plug-and-play architecture. Empirical evaluations demonstrate that Struct-Searcher achieves substantial performance gains across multiple benchmarks, improving average accuracy by 17.2% on BrowseComp-VL and outperforming state-of-the-art methods on tasks such as MM-BrowseComp.

📝 Abstract

Deep research agents have attracted increasing attention for their ability to collect large-scale online information to acquire target knowledge, with recent efforts shifting from purely text-based information seeking to multimodal settings. However, existing agentic workflows are largely aligned with evidence accumulation models, which linearly aggregate evidence and lack principled mechanisms for handling contradictory information across heterogeneous modalities. Towards this end, we propose Struct-Searcher, a structural agentic workflow grounded in belief revision theory that explicitly maintains an evolving multimodal structural graph throughout the reasoning process, enabling effective conflict-aware multimodal deep information seeking. Extensive experiments across multiple benchmark datasets and backbone models demonstrate that Struct-Searcher is (1) plug-and-play and model-agnostic, yielding an average relative accuracy improvement of 17.2% on BrowseComp-VL across five different backbones. (2) top-performing, consistently outperforming state-of-the-art vision-language models (VLMs) and deep research agents, with relative accuracy improvements of 3.7% on MM-BrowseComp, 1.5% on HLE-VL, and 0.7% on BrowseComp-VL over the second-best competing approach.

Problem

Research questions and friction points this paper is trying to address.

multimodal information seeking

contradictory information

heterogeneous modalities

evidence accumulation

deep research agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Struct-Searcher

belief revision

multimodal reasoning