GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current evaluations of large audio language models (LALMs) lack authentic linguistic–cultural context and acoustic realism, limiting their ability to reflect real-world performance. To address this gap, this work proposes GlobeAudio—the first multilingual, naturalistic benchmark constructed by native speakers from real-world audio recordings, spanning six typologically diverse languages and comprising 5,637 multiple-choice questions that require high-level auditory reasoning and cultural contextual understanding. Integrating multilingual linguistic expertise with human-annotated authentic audio, GlobeAudio systematically evaluates both open- and closed-source LALMs as well as ASR–LLM pipeline systems. Experimental results reveal substantial performance degradation under natural acoustic conditions, particularly for open-source models and low-resource languages, thereby exposing critical limitations in current LALMs and filling a crucial void in evaluating audio understanding in realistic scenarios.

📝 Abstract

Large Audio-Language Models (LALMs) integrate audio perception and language understanding within a unified framework, enabling a wide range of real-world applications. Despite recent advances, evaluation for LALMs remains heavily underspecified relative to real-world requirements: most lack true linguistic and cultural authenticity, while others fail to capture acoustic realism. To bridge this gap, we propose GlobeAudio, a multilingual and multicultural benchmark designed to evaluate naturalistic audio understanding. GlobeAudio consists of 5,637 multiple-choice questions across six typologically diverse languages, expertly crafted by native speakers grounded on naturally occurring audio. In order to do well, models must possess higher-level auditory reasoning skills and culturally grounded interpretation. We systematically evaluate representative closed-source and open-source LALMs, as well as cascaded ASR-LLM pipelines. Our experiments reveal substantial performance gaps under natural acoustic conditions, particularly for open-source models and low-resource languages. These findings highlight critical limitations of current LALMs and underscore the importance of naturalistic audio evaluation for future audio-language systems. GlobeAudio can be found at https://huggingface.co/datasets/iNLP-Lab/GlobeAudio .

Problem

Research questions and friction points this paper is trying to address.

Large Audio-Language Models

naturalistic evaluation

multilingual

multicultural

acoustic realism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Audio-Language Models

Naturalistic Evaluation

Multilingual Benchmark

Cultural Grounding

Auditory Reasoning

🔎 Similar Papers

AudioBench: A Universal Benchmark for Audio Large Language Models

2024-06-23arXiv.orgCitations: 17