Dynamic Orchestration of Multi-Agent System for Real-World Multi-Image Agricultural VQA

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing agricultural visual question answering (VQA) methods rely on single-image inputs and static pipelines, limiting their ability to jointly reason across multi-scale, multi-temporal imagery and perform robust inference under external knowledge scarcity. Method: This paper proposes a self-reflective multi-agent framework tailored for real-world agricultural scenarios. It establishes a dynamic four-role architecture—retrieval, reflection, answer generation, and refinement—enabling cross-image spatial-temporal alignment, real-time retrieval of external agricultural knowledge, and context-aware fusion. Parallel reasoning and iterative answer refinement overcome traditional bottlenecks of evidence limitation and pipeline rigidity. Contribution/Results: Evaluated on the AgMMU benchmark, our framework achieves significant improvements in accuracy and robustness. It delivers a scalable, verifiable, and systematic solution for complex agricultural VQA, advancing beyond monolithic and inflexible architectures.

Technology Category

Application Category

📝 Abstract
Agricultural visual question answering is essential for providing farmers and researchers with accurate and timely knowledge. However, many existing approaches are predominantly developed for evidence-constrained settings such as text-only queries or single-image cases. This design prevents them from coping with real-world agricultural scenarios that often require multi-image inputs with complementary views across spatial scales, and growth stages. Moreover, limited access to up-to-date external agricultural context makes these systems struggle to adapt when evidence is incomplete. In addition, rigid pipelines often lack systematic quality control. To address this gap, we propose a self-reflective and self-improving multi-agent framework that integrates four roles, the Retriever, the Reflector, the Answerer, and the Improver. They collaborate to enable context enrichment, reflective reasoning, answer drafting, and iterative improvement. A Retriever formulates queries and gathers external information, while a Reflector assesses adequacy and triggers sequential reformulation and renewed retrieval. Two Answerers draft candidate responses in parallel to reduce bias. The Improver refines them through iterative checks while ensuring that information from multiple images is effectively aligned and utilized. Experiments on the AgMMU benchmark show that our framework achieves competitive performance on multi-image agricultural QA.
Problem

Research questions and friction points this paper is trying to address.

Addressing multi-image agricultural VQA with complementary spatial and temporal views
Overcoming limitations of single-image approaches and incomplete external evidence
Enhancing systematic quality control through self-reflective multi-agent collaboration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework with four collaborative roles
Self-reflective reasoning with iterative retrieval reformulation
Parallel answer drafting and multi-image alignment refinement
🔎 Similar Papers
No similar papers found.
Yan Ke
Yan Ke
The University of Queensland, Brisbane QLD 4072, Australia
X
Xin Yu
The University of Queensland, Brisbane QLD 4072, Australia
Heming Du
Heming Du
The University of Queensland
computer vision
S
Scott Chapman
The University of Queensland, Brisbane QLD 4072, Australia
Helen Huang
Helen Huang
The University of Queensland, Brisbane QLD 4072, Australia