🤖 AI Summary
This study addresses the limited generalization of existing crop disease classification models across diverse acquisition conditions—such as laboratory versus field settings—and the absence of a unified benchmark for systematic comparison. To this end, the authors introduce AgriPath-LF16, a new high-quality benchmark dataset that explicitly distinguishes between laboratory and field images. Using this benchmark, they conduct the first comprehensive cross-domain evaluation of convolutional neural networks (CNNs), contrastive vision-language models (VLMs), and generative VLMs. Their findings reveal that CNNs achieve the highest performance on in-domain laboratory data but exhibit poor out-of-domain generalization; contrastive VLMs demonstrate parameter efficiency and robust cross-domain performance; and generative VLMs show the greatest robustness to distribution shifts, albeit with failure modes tied to text generation. The work underscores the importance of selecting model architectures based on deployment context rather than solely on accuracy metrics.
📝 Abstract
Reliable crop disease detection requires models that perform consistently across diverse acquisition conditions, yet existing evaluations often focus on single architectural families or lab-generated datasets. This work presents a systematic empirical comparison of three model paradigms for fine-grained crop disease classification: Convolutional Neural Networks (CNNs), contrastive Vision-Language Models (VLMs), and generative VLMs. To enable controlled analysis of domain effects, we introduce AgriPath-LF16, a benchmark containing 111k images spanning 16 crops and 41 diseases with explicit separation between laboratory and field imagery, alongside a balanced 30k subset for standardized training and evaluation. All models are trained and evaluated under unified protocols across full, lab-only, and field-only training regimes using macro-F1 and Parse Success Rate (PSR) to account for generative reliability. The results reveal distinct performance profiles. CNNs achieve the highest accuracy on lab imagery but degrade under domain shift. Contrastive VLMs provide a robust and parameter-efficient alternative with competitive cross-domain performance. Generative VLMs demonstrate the strongest resilience to distributional variation, albeit with additional failure modes stemming from free-text generation. These findings highlight that architectural choice should be guided by deployment context rather than aggregate accuracy alone.