🤖 AI Summary
This study systematically evaluates 69 audio-language datasets available as of September 2024, revealing pervasive issues including acoustic class imbalance, multi-source duplication, linguistic homogeneity (dominant English bias), restricted accessibility, and latent societal biases. Methodologically, we innovatively integrate PCA-based cross-dataset embedding variance analysis, CLAP-guided detection of modality leakage, joint acoustic–textual distribution modeling, and open governance practices to quantitatively identify systemic biases—particularly in widely used sources such as YouTube and Freesound. As a key contribution, we release an open resource library comprising over two million samples and propose a comprehensive Audio-Language Modeling (ALM) data curation roadmap that explicitly balances diversity, robustness, and fairness. This work establishes an empirically grounded, reproducible methodology for dataset development, directly supporting improved generalization capabilities of multimodal models.
📝 Abstract
Audio-language models (ALMs) generate linguistic descriptions of sound-producing events and scenes. Advances in dataset creation and computational power have led to significant progress in this domain. This paper surveys 69 datasets used to train ALMs, covering research up to September 2024 (https://github.com/GLJS/audio-datasets). It provides a comprehensive analysis of datasets origins, audio and linguistic characteristics, and use cases. Key sources include YouTube-based datasets like AudioSet with over two million samples, and community platforms like Freesound with over 1 million samples. Through principal component analysis of audio and text embeddings, the survey evaluates the acoustic and linguistic variability across datasets. It also analyzes data leakage through CLAP embeddings, and examines sound category distributions to identify imbalances. Finally, the survey identifies key challenges in developing large, diverse datasets to enhance ALM performance, including dataset overlap, biases, accessibility barriers, and the predominance of English-language content, while highlighting opportunities for improvement.