🤖 AI Summary
To address excessive communication overhead, high token consumption, and poor scalability in multimodal multi-agent retrieval-augmented generation (RAG) systems, this paper proposes a hierarchical communication graph pruning framework. It pioneers the application of hierarchical graph pruning to multi-agent coordination, adaptively identifying and preserving critical communication pathways via intra-modal sparsification and cross-modal dynamic topology construction. Integrated with multimodal large language models and external knowledge retrieval, the framework employs a progressive pruning strategy that significantly reduces redundant agent interactions while preserving collaborative performance. Experimental results demonstrate that our method consistently outperforms both single-agent baselines and state-of-the-art multi-agent RAG systems on general and domain-specific benchmarks, achieving an average 32.7% reduction in token consumption and a 2.1× speedup in inference latency. This work establishes a novel paradigm for efficient, scalable multimodal multi-agent RAG.
📝 Abstract
Recent advancements in multi-modal retrieval-augmented generation (mRAG), which enhance multi-modal large language models (MLLMs) with external knowledge, have demonstrated that the collective intelligence of multiple agents can significantly outperform a single model through effective communication. Despite impressive performance, existing multi-agent systems inherently incur substantial token overhead and increased computational costs, posing challenges for large-scale deployment. To address these issues, we propose a novel Multi-Modal Multi-agent hierarchical communication graph PRUNING framework, termed M$^3$Prune. Our framework eliminates redundant edges across different modalities, achieving an optimal balance between task performance and token overhead. Specifically, M$^3$Prune first applies intra-modal graph sparsification to textual and visual modalities, identifying the edges most critical for solving the task. Subsequently, we construct a dynamic communication topology using these key edges for inter-modal graph sparsification. Finally, we progressively prune redundant edges to obtain a more efficient and hierarchical topology. Extensive experiments on both general and domain-specific mRAG benchmarks demonstrate that our method consistently outperforms both single-agent and robust multi-agent mRAG systems while significantly reducing token consumption.