🤖 AI Summary
To address the lack of large-scale, high-quality benchmark datasets for avian acoustic recognition, this paper introduces BirdSet—the first open-source avian audio benchmark covering nearly 10,000 species, with over 6,800 hours of training audio and more than 400 hours of multi-scenario evaluation data. Its key contributions are: (1) a class scale 18× larger than AudioSet and 17% greater total duration; (2) the first unified annotation schema across eight heterogeneous, multi-source evaluation sets, enabling multi-label classification, covariate shift analysis, and self-supervised learning; and (3) Hugging Face–hosted infrastructure with standardized preprocessing, PyTorch-compatible code, and implementations of multi-label loss functions. Comprehensive evaluation across six state-of-the-art models and three training paradigms demonstrates that BirdSet significantly improves robustness in recognizing rare bird species and advances reproducible benchmarks in bioacoustics.
📝 Abstract
Deep learning (DL) has greatly advanced audio classification, yet the field is limited by the scarcity of large-scale benchmark datasets that have propelled progress in other domains. While AudioSet is a pivotal step to bridge this gap as a universal-domain dataset, its restricted accessibility and limited range of evaluation use cases challenge its role as the sole resource. Therefore, we introduce exttt{BirdSet}, a large-scale benchmark dataset for audio classification focusing on avian bioacoustics. exttt{BirdSet} surpasses AudioSet with over 6,800 recording hours~($uparrow!17%$) from nearly 10,000 classes~($uparrow!18 imes$) for training and more than 400 hours~($uparrow!7 imes$) across eight strongly labeled evaluation datasets. It serves as a versatile resource for use cases such as multi-label classification, covariate shift or self-supervised learning. We benchmark six well-known DL models in multi-label classification across three distinct training scenarios and outline further evaluation use cases in audio classification. We host our dataset on Hugging Face for easy accessibility and offer an extensive codebase to reproduce our results.