Plant Disease Detection through Multimodal Large Language Models and Convolutional Neural Networks

📅 2025-04-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-grained classification, poor generalization under limited samples, and heavy reliance on high-resolution imagery hinder robust plant disease identification. Method: This paper proposes a synergistic multimodal framework integrating the large language model GPT-4o with the convolutional neural network ResNet-50, employing progressive fine-tuning and zero-/few-shot learning strategies. Contribution/Results: It is the first systematic validation of GPT-4o’s tunability for fine-grained plant disease recognition, overcoming the dependency of purely vision-based models on large-scale, high-resolution data. On an apple leaf disease dataset, the framework achieves 98.12% accuracy—significantly surpassing ResNet-50 (96.88%)—with near-zero training loss. Moreover, it demonstrates strong cross-resolution and cross-species generalization (e.g., to maize), establishing a new paradigm for agricultural intelligent monitoring that is highly robust and resource-efficient.

Technology Category

Application Category

📝 Abstract
Automation in agriculture plays a vital role in addressing challenges related to crop monitoring and disease management, particularly through early detection systems. This study investigates the effectiveness of combining multimodal Large Language Models (LLMs), specifically GPT-4o, with Convolutional Neural Networks (CNNs) for automated plant disease classification using leaf imagery. Leveraging the PlantVillage dataset, we systematically evaluate model performance across zero-shot, few-shot, and progressive fine-tuning scenarios. A comparative analysis between GPT-4o and the widely used ResNet-50 model was conducted across three resolutions (100, 150, and 256 pixels) and two plant species (apple and corn). Results indicate that fine-tuned GPT-4o models achieved slightly better performance compared to the performance of ResNet-50, achieving up to 98.12% classification accuracy on apple leaf images, compared to 96.88% achieved by ResNet-50, with improved generalization and near-zero training loss. However, zero-shot performance of GPT-4o was significantly lower, underscoring the need for minimal training. Additional evaluations on cross-resolution and cross-plant generalization revealed the models' adaptability and limitations when applied to new domains. The findings highlight the promise of integrating multimodal LLMs into automated disease detection pipelines, enhancing the scalability and intelligence of precision agriculture systems while reducing the dependence on large, labeled datasets and high-resolution sensor infrastructure. Large Language Models, Vision Language Models, LLMs and CNNs, Disease Detection with Vision Language Models, VLMs
Problem

Research questions and friction points this paper is trying to address.

Automating plant disease detection using multimodal LLMs and CNNs
Evaluating performance of GPT-4o versus ResNet-50 for disease classification
Enhancing precision agriculture with scalable, low-data AI solutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines multimodal LLMs with CNNs
Uses GPT-4o for disease classification
Fine-tuning enhances model accuracy
🔎 Similar Papers
No similar papers found.