Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing evaluations of multimodal large language models (MLLMs) focus primarily on answer accuracy, offering limited insight into whether models ground their responses in correct visual evidence within multi-view autonomous driving scenarios. To address this gap, this work introduces the first multi-view visual question answering benchmark that decouples the source of visual evidence from answer correctness. The benchmark requires models to identify the correct camera view among six synchronized NuScenes images that supports their answer and then respond to the question. It encompasses conflict-intensive tasks such as causal reasoning, counterfactual analysis, and intention prediction, comprising 122 human-verified question-answer pairs with viewpoint annotations derived from 73 scenes via automated conflict mining. A viewpoint selection mechanism and an LLM-based adjudicator are employed to evaluate free-text responses. Experiments demonstrate that the benchmark effectively exposes failures in visual grounding, offering a novel evaluation dimension for trustworthy multimodal reasoning.

📝 Abstract

Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model relied on the correct visual evidence. This gap is particularly important in multi-view driving scenes used for autonomous driving, where a model can produce a plausible answer while grounding it in the wrong camera view. We introduce a multi-view visual question answering benchmark for evaluating evidence-source identification: given six synchronized NuScenes views and a question, the model must identify the supporting camera view and answer the question. The benchmark contains 122 conflict-centric question-answer pairs from 73 scenes, spanning causality, counterfactual reasoning, and intent prediction. View labels are proposed by an automatic conflict-mining pipeline and manually verified by annotators. We evaluate three settings: camera-view selection, oracle QA given the golden view, and joint prediction in which the model selects a view and answers in one pass. Answers are evaluated in both multiple-choice and free-form formats, using exact match for structured predictions and an LLM judge for free-form responses. By explicitly separating visual-source identification from answer correctness, the benchmark exposes grounding failures that answer-only evaluation misses.

Problem

Research questions and friction points this paper is trying to address.

visual evidence identification

multi-view MLLMs

autonomous driving

grounding failure

camera view selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-view MLLMs

visual evidence identification

autonomous driving