Audio-Language Datasets of Scenes and Events: A Survey

📅 2024-07-09
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates 69 audio-language datasets available as of September 2024, revealing pervasive issues including acoustic class imbalance, multi-source duplication, linguistic homogeneity (dominant English bias), restricted accessibility, and latent societal biases. Methodologically, we innovatively integrate PCA-based cross-dataset embedding variance analysis, CLAP-guided detection of modality leakage, joint acoustic–textual distribution modeling, and open governance practices to quantitatively identify systemic biases—particularly in widely used sources such as YouTube and Freesound. As a key contribution, we release an open resource library comprising over two million samples and propose a comprehensive Audio-Language Modeling (ALM) data curation roadmap that explicitly balances diversity, robustness, and fairness. This work establishes an empirically grounded, reproducible methodology for dataset development, directly supporting improved generalization capabilities of multimodal models.

Technology Category

Application Category

📝 Abstract
Audio-language models (ALMs) generate linguistic descriptions of sound-producing events and scenes. Advances in dataset creation and computational power have led to significant progress in this domain. This paper surveys 69 datasets used to train ALMs, covering research up to September 2024 (https://github.com/GLJS/audio-datasets). It provides a comprehensive analysis of datasets origins, audio and linguistic characteristics, and use cases. Key sources include YouTube-based datasets like AudioSet with over two million samples, and community platforms like Freesound with over 1 million samples. Through principal component analysis of audio and text embeddings, the survey evaluates the acoustic and linguistic variability across datasets. It also analyzes data leakage through CLAP embeddings, and examines sound category distributions to identify imbalances. Finally, the survey identifies key challenges in developing large, diverse datasets to enhance ALM performance, including dataset overlap, biases, accessibility barriers, and the predominance of English-language content, while highlighting opportunities for improvement.
Problem

Research questions and friction points this paper is trying to address.

Audio-lingual Model Training
Dataset Analysis
Data Bias and Limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-Language Datasets
Data Diversity Assessment
Bias and Redundancy Evaluation
🔎 Similar Papers
No similar papers found.
G
Gijs Wijngaard
Department of Advanced Computing Sciences, Faculty of Science and Engineering, Maastricht University
Elia Formisano
Elia Formisano
Maastricht Brain Imaging Center, Maastricht University
Auditory Cognitive NeuroscienceAuditory cortexfMRIComputational Neuroimaging
Michele Esposito
Michele Esposito
Assistant Professor of Medicine, Medical University of South Carolina
M
M. Dumontier
Department of Advanced Computing Sciences, Faculty of Science and Engineering, Maastricht University