π€ AI Summary
This study addresses a critical gap in the evaluation of large audio language models, which has predominantly focused on task-specific or modality-level performance while overlooking the core mechanisms of human auditory cognition. To bridge this gap, the work introduces the Cattell-Horn-Carroll (CHC) cognitive theory into the field for the first time and proposes RAILβa cognition-centered evaluation paradigm. RAIL formally defines five key auditory cognitive abilities and establishes a suite of cognition-aligned structured tasks, a principled dataset, and a human-aligned evaluation protocol. Experiments across 26 state-of-the-art models reveal significant imbalances in model capabilities across perceptual, memory, and reasoning dimensions, thereby demonstrating RAILβs effectiveness and necessity in systematically uncovering cognitive weaknesses in current audio language models.
π Abstract
Humans process rich auditory environments through tightly integrated cognitive capabilities such as audio perception, audio reasoning, and memory. Despite recent progress in large audio-language models (LALMs) across speech understanding and multimodal audio reasoning, current evaluation paradigms remain largely task- or modality-centric, focusing on end performance while overlooking underlying auditory cognitive behaviours. This reveals a fundamental gap between how auditory cognition is understood in humans and how it is evaluated in LALMs, particularly in the lack of frameworks that operationalise cognitive principles beyond task-level metrics to systematically capture model behaviour. In this work, we introduce RAIL, a human-centric evaluation paradigm grounded in the Cattell-Horn-Carroll (CHC) cognitive framework. RAIL formalises auditory cognition into five core capabilities and develop them into structured evaluation tasks that probe how models process, retain, and integrate auditory information. We further construct a cognitively grounded benchmark with principled data curation and human-aligned evaluation protocols. Evaluating 26 state-of-the-art LALMs, we find that current models exhibit highly uneven performance across cognitive abilities. RAIL establishes a new evaluation paradigm that moves beyond task-centric benchmarking toward cognitively grounded assessment of auditory intelligence.