AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning

📅 2025-08-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual (AV) benchmarks evaluate only answer accuracy, failing to detect “correct” answers arising from flawed reasoning or hallucination—thus obscuring models’ true comprehension. Method: We introduce AURA, a fine-grained AV reasoning benchmark comprising six cross-modal cognitive tasks (e.g., causality, timbre, rhythm), requiring explicit evidence-based logical inference over synchronized audio and video inputs while suppressing unimodal shortcuts. We propose AuraScore—a novel metric quantifying reasoning quality along two orthogonal dimensions: *factual consistency* (alignment with multimodal evidence) and *core inference validity* (logical soundness of the critical reasoning step)—assessed via human-annotated data and a dual-track scoring protocol combining rule-based heuristics and model-assisted evaluation. Results: Evaluations across multiple state-of-the-art AV foundation models reveal a stark dissociation: although answer accuracy reaches 92%, both AuraScore dimensions fall below 45%, exposing severe deficits in formal, evidence-grounded reasoning for the first time.

Technology Category

Application Category

📝 Abstract
Current audio-visual (AV) benchmarks focus on final answer accuracy, overlooking the underlying reasoning process. This makes it difficult to distinguish genuine comprehension from correct answers derived through flawed reasoning or hallucinations. To address this, we introduce AURA (Audio-visual Understanding and Reasoning Assessment), a benchmark for evaluating the cross-modal reasoning capabilities of Audio-Visual Large Language Models (AV-LLMs) and Omni-modal Language Models (OLMs). AURA includes questions across six challenging cognitive domains, such as causality, timbre and pitch, tempo and AV synchronization, unanswerability, implicit distractions, and skill profiling, explicitly designed to be unanswerable from a single modality. This forces models to construct a valid logical path grounded in both audio and video, setting AURA apart from AV datasets that allow uni-modal shortcuts. To assess reasoning traces, we propose a novel metric, AuraScore, which addresses the lack of robust tools for evaluating reasoning fidelity. It decomposes reasoning into two aspects: (i) Factual Consistency - whether reasoning is grounded in perceptual evidence, and (ii) Core Inference - the logical validity of each reasoning step. Evaluations of SOTA models on AURA reveal a critical reasoning gap: although models achieve high accuracy (up to 92% on some tasks), their Factual Consistency and Core Inference scores fall below 45%. This discrepancy highlights that models often arrive at correct answers through flawed logic, underscoring the need for our benchmark and paving the way for more robust multimodal evaluation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating audio-visual reasoning beyond final answer accuracy
Assessing reasoning fidelity through factual consistency and logic
Preventing uni-modal shortcuts in cross-modal understanding tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

AURA benchmark for cross-modal reasoning evaluation
AuraScore metric decomposing reasoning into two aspects
Forcing models to use both audio and video
🔎 Similar Papers
No similar papers found.