AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing Omni-MLLMs lack a systematic evaluation benchmark for audiovisual intelligence, making it difficult to comprehensively assess their cross-modal perception, understanding, and reasoning capabilities. To address this gap, this work proposes AVI-Bench, a cognition-inspired benchmark that establishes a staged, fine-grained evaluation framework encompassing three hierarchical levels: perception, understanding, and reasoning. It further introduces AVI-Bench-PriSe, a low-semantic raw stimulus subset designed to probe models’ foundational sensory generalization abilities. The study pioneers a four-tier taxonomy of audiovisual intelligence (AVI) capabilities and integrates a cross-modal joint interpretation framework with systematic diagnostic analysis. Evaluations across multiple mainstream Omni-MLLMs reveal significant limitations in their audiovisual intelligence, thereby validating the effectiveness and diagnostic utility of the proposed benchmark.

📝 Abstract

Recent advances in Omni-Multimodal Large Language Models (Omni-MLLMs) have enabled strong integration of vision, audio, and language. However, their audio-visual intelligence (AVI) remains insufficiently evaluated due to the lack of systematic and comprehensive benchmarks. We introduce AVI-Bench, a cognitively inspired benchmark that evaluates Omni-MLLMs across three stages, perception, understanding, and reasoning, through cross-modal tasks requiring joint audio-visual interpretation. This design enables fine-grained diagnosis of model capabilities and failure modes. To further assess robustness beyond familiar domains, we propose AVI-Bench-PriSe, an extension that probes models' primitive audio-visual sensation using unfamiliar, low-semantic stimuli, testing generalization beyond common training distributions. Extensive experiments on both open-source and closed-source models reveal substantial limitations in current Omni-MLLMs. Based on these findings, we present a four-level AVI taxonomy. Overall, AVI-Bench provides a principled evaluation framework to guide the development of more robust and generalizable AVI. Project website: https://fudancvl.github.io/AVI-Bench/

Problem

Research questions and friction points this paper is trying to address.

audio-visual intelligence

Omni-MLLMs

benchmark

cross-modal evaluation

generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-Visual Intelligence

Omni-MLLMs

AVI-Bench

cross-modal reasoning

generalization evaluation

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

2024-06-14arXiv.orgCitations: 15