GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

📅 2026-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluations of large audio language models (LALMs) lack authentic linguistic–cultural context and acoustic realism, limiting their ability to reflect real-world performance. To address this gap, this work proposes GlobeAudio—the first multilingual, naturalistic benchmark constructed by native speakers from real-world audio recordings, spanning six typologically diverse languages and comprising 5,637 multiple-choice questions that require high-level auditory reasoning and cultural contextual understanding. Integrating multilingual linguistic expertise with human-annotated authentic audio, GlobeAudio systematically evaluates both open- and closed-source LALMs as well as ASR–LLM pipeline systems. Experimental results reveal substantial performance degradation under natural acoustic conditions, particularly for open-source models and low-resource languages, thereby exposing critical limitations in current LALMs and filling a crucial void in evaluating audio understanding in realistic scenarios.
📝 Abstract
Large Audio-Language Models (LALMs) integrate audio perception and language understanding within a unified framework, enabling a wide range of real-world applications. Despite recent advances, evaluation for LALMs remains heavily underspecified relative to real-world requirements: most lack true linguistic and cultural authenticity, while others fail to capture acoustic realism. To bridge this gap, we propose GlobeAudio, a multilingual and multicultural benchmark designed to evaluate naturalistic audio understanding. GlobeAudio consists of 5,637 multiple-choice questions across six typologically diverse languages, expertly crafted by native speakers grounded on naturally occurring audio. In order to do well, models must possess higher-level auditory reasoning skills and culturally grounded interpretation. We systematically evaluate representative closed-source and open-source LALMs, as well as cascaded ASR-LLM pipelines. Our experiments reveal substantial performance gaps under natural acoustic conditions, particularly for open-source models and low-resource languages. These findings highlight critical limitations of current LALMs and underscore the importance of naturalistic audio evaluation for future audio-language systems. GlobeAudio can be found at https://huggingface.co/datasets/iNLP-Lab/GlobeAudio .
Problem

Research questions and friction points this paper is trying to address.

Large Audio-Language Models
naturalistic evaluation
multilingual
multicultural
acoustic realism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Audio-Language Models
Naturalistic Evaluation
Multilingual Benchmark
Cultural Grounding
Auditory Reasoning
🔎 Similar Papers