π€ AI Summary
General-purpose Masked Autoencoders (MAEs) struggle to capture domain-specific acoustic features critical for fine-grained avian vocalization classification in bioacoustic monitoring.
Method: We propose Bird-MAE, the first MAE architecture specifically designed for avian acoustics. Leveraging the large-scale BirdSet dataset, we introduce an acoustic-aware MAE framework with optimized masking strategies, domain-adaptive pretraining, and novel frozen-representation utilization. We further devise prototypical probingβa parameter-efficient, highly discriminative transfer method for frozen representations.
Results: Bird-MAE achieves state-of-the-art performance across all BirdSet downstream tasks. In multi-label classification, it significantly outperforms generic Audio-MAE baselines in mean Average Precision (mAP). Prototypical probing yields up to 37% higher mAP than linear probing and approaches full-parameter fine-tuning performance, with an average gap of only β3%.
π Abstract
Masked Autoencoders (MAEs) pretrained on AudioSet fail to capture the fine-grained acoustic characteristics of specialized domains such as bioacoustic monitoring. Bird sound classification is critical for assessing environmental health, yet general-purpose models inadequately address its unique acoustic challenges. To address this, we introduce Bird-MAE, a domain-specialized MAE pretrained on the large-scale BirdSet dataset. We explore adjustments to pretraining, fine-tuning and utilizing frozen representations. Bird-MAE achieves state-of-the-art results across all BirdSet downstream tasks, substantially improving multi-label classification performance compared to the general-purpose Audio-MAE baseline. Additionally, we propose prototypical probing, a parameter-efficient method for leveraging MAEs' frozen representations. Bird-MAE's prototypical probes outperform linear probing by up to 37% in MAP and narrow the gap to fine-tuning to approximately 3% on average on BirdSet.