MAVIS: Multi-Agent Video Retrieval via Structured Video Understanding

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the inefficiency of full-database embedding scans and the semantic asymmetry between text queries and video content in conventional video retrieval. To overcome these limitations, the authors propose a multi-agent collaborative reasoning framework that structures a semantic knowledge base for attribute-level video indexing. A planner decomposes user queries and orchestrates specialized agents to nominate candidates, followed by a logic-aware debate mechanism that jointly eliminates contradictory results through veto-enabled logical reasoning. The final step focuses on contested samples for fine-grained verification. Innovatively reframing video retrieval as a fine-tuning-free multi-agent inference process, the approach integrates task decomposition, structured indexing, and logic-based debate protocols. Extensive experiments on MSR-VTT, MSVD, and ActivityNet demonstrate substantial gains in retrieval efficiency and interpretability, along with strong zero-shot transfer capability across datasets.

📝 Abstract

The dominant paradigm in video retrieval relies on embedding-based full-corpus scanning, which suffers from inherent computational inefficiency and the semantic asymmetry between information-dense videos and sparse textual queries. To bridge this gap, we introduce \textbf{MAVIS}, a novel multi-agent framework that rethinks retrieval as cooperative reasoning rather than brute-force search. MAVIS first bridges the granularity mismatch by parsing raw videos into a \textbf{Structured Semantic Library}, enabling explicit attribute-level indexing. During retrieval, a planner decomposes complex user intents into atomic sub-tasks, dispatching specialized agents to independently nominate candidates. Crucially, MAVIS employs a \textbf{Logic-aware Debate} mechanism with a strict veto protocol, where agents collaboratively prune logical mismatches to identify a compact set of ``controversial'' candidates for fine-grained verification. This agentic workflow effectively bypasses the inefficiency of full-library traversal. Extensive experiments on MSR-VTT, MSVD, and ActivityNet demonstrate that MAVIS achieves competitive performance without task-specific fine-tuning, offering a scalable and interpretable alternative to traditional dual-encoder approaches.

Problem

Research questions and friction points this paper is trying to address.

video retrieval

semantic asymmetry

computational inefficiency

granularity mismatch

structured video understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent Retrieval

Structured Video Understanding

Logic-aware Debate