🤖 AI Summary
Existing VQA benchmarks inadequately evaluate visual language models’ (VLMs) logical reasoning and problem-solving capabilities in complex agricultural scenarios. To address this, we introduce AgriCoT—the first Chain-of-Thought (CoT)-enhanced visual question answering dataset for agriculture, comprising 4,535 samples. AgriCoT is the first to incorporate human-annotated CoT rationales into agricultural VLM evaluation, enabling fine-grained analysis of multimodal understanding and stepwise reasoning. Zero-shot evaluation across 26 state-of-the-art VLMs reveals that while proprietary models achieve higher answer accuracy, they exhibit substantial deficiencies in reasoning coherence and causal logic. This work bridges a critical gap in explainable reasoning assessment for agriculture and advances VLM evaluation from an “answer correctness” paradigm toward a “reasoning validity” paradigm—emphasizing not only what is answered, but how and why.
📝 Abstract
Recent advancements in Vision-Language Models (VLMs) have significantly transformed various industries. In agriculture, these dual-modal capabilities offer promising applications such as precision farming, crop monitoring, pest detection, and environmental sustainability. While several Visual Question Answering (VQA) datasets and benchmarks have been developed to evaluate VLM performance, they often fail to adequately assess the critical reasoning and problem-solving skills required in complex agricultural contexts. To address this gap, we introduce AgriCoT, a VQA dataset that incorporates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of VLMs. With 4,535 carefully curated samples, AgriCoT offers a comprehensive and robust evaluation of reasoning abilities for VLMs, particularly in zero-shot scenarios, by focusing on their capacity to engage in logical reasoning and effective problem-solving. Our evaluations, conducted with 26 representative VLMs, including both proprietary and open-source models, reveal that while some proprietary models excel at answering questions, there is a notable and significant gap in their reasoning capabilities. This underscores the importance of incorporating CoT for more precise and effective assessments. Our dataset are available at https://huggingface.co/datasets/wenyb/AgriCoT.