🤖 AI Summary
Existing evaluation benchmarks for vision-oriented Retrieval-Augmented Generation (RAG) lack systematic modeling of query difficulty and ambiguity, hindering fine-grained diagnosis of model failure modes on complex queries.
Method: We introduce the first difficulty- and ambiguity-aware diagnostic evaluation platform for vision RAG, featuring a multi-granularity framework for quantifying query complexity and a claim-level hallucination detection tool, MM-RAGChecker. The platform unifies diverse benchmarks—including WebQA, Chart-RAG, Visual-RAG, and MRAG-Bench—and employs controllable filtering to isolate high-difficulty and high-ambiguity samples.
Contribution/Results: Experiments reveal that state-of-the-art models suffer substantial accuracy degradation (average drop of 28.6%) on challenging queries. MM-RAGChecker enables precise, fine-grained attribution of hallucinated claims to their root causes—e.g., retrieval errors, multimodal misalignment, or reasoning flaws—establishing an interpretable diagnostic paradigm for robustness analysis and targeted improvement of vision RAG systems.
📝 Abstract
Multimodal Retrieval-Augmented Generation (Visual RAG) significantly advances question answering by integrating visual and textual evidence. Yet, current evaluations fail to systematically account for query difficulty and ambiguity. We propose MRAG-Suite, a diagnostic evaluation platform integrating diverse multimodal benchmarks (WebQA, Chart-RAG, Visual-RAG, MRAG-Bench). We introduce difficulty-based and ambiguity-aware filtering strategies, alongside MM-RAGChecker, a claim-level diagnostic tool. Our results demonstrate substantial accuracy reductions under difficult and ambiguous queries, highlighting prevalent hallucinations. MM-RAGChecker effectively diagnoses these issues, guiding future improvements in Visual RAG systems.