🤖 AI Summary
Existing agricultural multimodal benchmarks predominantly rely on closed-set classification and explicit queries, failing to capture open-ended challenges inherent in expert consultation—such as ambiguous user intents, implicit knowledge gaps, and frequent occurrences of rare biological entities. To address this, we propose MIRAGE, the first high-fidelity multimodal benchmark tailored for agricultural expert dialogue, constructed from 35,000 real human-expert interactions comprising text queries, expert responses, and image contexts. Its core innovations include: (1) an open-world evaluation setting requiring models to detect knowledge blind spots, link over 7,000 distinct biological entities, generate clarification strategies, and produce extended textual responses; (2) rigorous multi-stage human annotation and validation; and (3) support for joint vision-language reasoning and domain-knowledge integration. MIRAGE significantly enhances the validity of evaluating complex agricultural reasoning, cross-modal understanding, and professional dialogue capabilities, thereby advancing trustworthy deployment of knowledge-intensive AI in real-world agricultural applications.
📝 Abstract
We introduce MIRAGE, a new benchmark for multimodal expert-level reasoning and decision-making in consultative interaction settings. Designed for the agriculture domain, MIRAGE captures the full complexity of expert consultations by combining natural user queries, expert-authored responses, and image-based context, offering a high-fidelity benchmark for evaluating models on grounded reasoning, clarification strategies, and long-form generation in a real-world, knowledge-intensive domain. Grounded in over 35,000 real user-expert interactions and curated through a carefully designed multi-step pipeline, MIRAGE spans diverse crop health, pest diagnosis, and crop management scenarios. The benchmark includes more than 7,000 unique biological entities, covering plant species, pests, and diseases, making it one of the most taxonomically diverse benchmarks available for vision-language models, grounded in the real world. Unlike existing benchmarks that rely on well-specified user inputs and closed-set taxonomies, MIRAGE features underspecified, context-rich scenarios with open-world settings, requiring models to infer latent knowledge gaps, handle rare entities, and either proactively guide the interaction or respond. Project Page: https://mirage-benchmark.github.io