Audio-Language Datasets of Scenes and Events: A Survey

📅 2024-07-09

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

This study systematically evaluates 69 audio-language datasets available as of September 2024, revealing pervasive issues including acoustic class imbalance, multi-source duplication, linguistic homogeneity (dominant English bias), restricted accessibility, and latent societal biases. Methodologically, we innovatively integrate PCA-based cross-dataset embedding variance analysis, CLAP-guided detection of modality leakage, joint acoustic–textual distribution modeling, and open governance practices to quantitatively identify systemic biases—particularly in widely used sources such as YouTube and Freesound. As a key contribution, we release an open resource library comprising over two million samples and propose a comprehensive Audio-Language Modeling (ALM) data curation roadmap that explicitly balances diversity, robustness, and fairness. This work establishes an empirically grounded, reproducible methodology for dataset development, directly supporting improved generalization capabilities of multimodal models.

Technology Category

Application Category

📝 Abstract

Audio-language models (ALMs) generate linguistic descriptions of sound-producing events and scenes. Advances in dataset creation and computational power have led to significant progress in this domain. This paper surveys 69 datasets used to train ALMs, covering research up to September 2024 (https://github.com/GLJS/audio-datasets). It provides a comprehensive analysis of datasets origins, audio and linguistic characteristics, and use cases. Key sources include YouTube-based datasets like AudioSet with over two million samples, and community platforms like Freesound with over 1 million samples. Through principal component analysis of audio and text embeddings, the survey evaluates the acoustic and linguistic variability across datasets. It also analyzes data leakage through CLAP embeddings, and examines sound category distributions to identify imbalances. Finally, the survey identifies key challenges in developing large, diverse datasets to enhance ALM performance, including dataset overlap, biases, accessibility barriers, and the predominance of English-language content, while highlighting opportunities for improvement.

Problem

Research questions and friction points this paper is trying to address.

Audio-lingual Model Training

Dataset Analysis

Data Bias and Limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-Language Datasets

Data Diversity Assessment

Bias and Redundancy Evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow