AgriDoctor: A Multimodal Intelligent Assistant for Agriculture

📅 2025-09-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current crop disease diagnosis methods predominantly rely on unimodal models, lacking integration of agricultural domain knowledge and natural language interaction capabilities; meanwhile, general-purpose multimodal large models suffer from limited performance due to agricultural data scarcity and insufficient domain adaptation. To address these limitations, we propose AgriMM—the first agent-based multimodal reasoning framework tailored for agriculture. It employs a modular architecture to enable synergistic decision-making across image recognition, language understanding, and agricultural knowledge retrieval, supporting intent-driven tool invocation and bilingual (Chinese–English) interaction. Evaluated on our newly constructed large-scale, fine-grained agricultural multimodal benchmark (also named AgriMM), the framework achieves state-of-the-art performance on both disease diagnosis and intelligent question-answering tasks, improving diagnostic accuracy by 12.6% over prior approaches. This demonstrates the effectiveness and scalability of agricultural knowledge-guided multimodal collaborative reasoning.

Technology Category

Application Category

📝 Abstract
Accurate crop disease diagnosis is essential for sustainable agriculture and global food security. Existing methods, which primarily rely on unimodal models such as image-based classifiers and object detectors, are limited in their ability to incorporate domain-specific agricultural knowledge and lack support for interactive, language-based understanding. Recent advances in large language models (LLMs) and large vision-language models (LVLMs) have opened new avenues for multimodal reasoning. However, their performance in agricultural contexts remains limited due to the absence of specialized datasets and insufficient domain adaptation. In this work, we propose AgriDoctor, a modular and extensible multimodal framework designed for intelligent crop disease diagnosis and agricultural knowledge interaction. As a pioneering effort to introduce agent-based multimodal reasoning into the agricultural domain, AgriDoctor offers a novel paradigm for building interactive and domain-adaptive crop health solutions. It integrates five core components: a router, classifier, detector, knowledge retriever and LLMs. To facilitate effective training and evaluation, we construct AgriMM, a comprehensive benchmark comprising 400000 annotated disease images, 831 expert-curated knowledge entries, and 300000 bilingual prompts for intent-driven tool selection. Extensive experiments demonstrate that AgriDoctor, trained on AgriMM, significantly outperforms state-of-the-art LVLMs on fine-grained agricultural tasks, establishing a new paradigm for intelligent and sustainable farming applications.
Problem

Research questions and friction points this paper is trying to address.

Existing crop disease diagnosis methods lack domain knowledge integration
Current multimodal models perform poorly in agricultural contexts
There is a shortage of specialized datasets for agricultural AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal framework integrating five core components
Agent-based reasoning with domain-adaptive agricultural knowledge
Comprehensive benchmark with annotated images and bilingual prompts
M
Mingqing Zhang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Z
Zhuoning Xu
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Peijie Wang
Peijie Wang
Institute of Automation Chinese Academy of Sciences
Multimodal LLMsmath reasoning
R
Rongji Li
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
L
Liang Wang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Q
Qiang Liu
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
J
Jian Xu
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
X
Xuyao Zhang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
S
Shu Wu
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
L
Liang Wang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences