π€ AI Summary
To address insufficient few-shot sound event detection capabilities in bioacoustics, this paper proposes a synthetic-data-driven query-based Transformer framework. We generate 8.8 thousand hours of strongly labeled audio via domain randomization, establishing the first publicly available few-shot bioacoustic benchmark covering 13 diverse tasks. Furthermore, we design a context-aware, training-free few-shot inference mechanism. Our work pioneers the use of synthetic data for pretraining foundation models in bioacoustics, significantly enhancing generalization to novel species and unseen environments. On few-shot detection benchmarks, our method achieves an average 49% improvement over state-of-the-art approaches. The model is deployed as an open API, enabling plug-and-play adoption by ecologists and behavioral scientists.
π Abstract
We propose a methodology for training foundation models that enhances their in-context learning capabilities within the domain of bioacoustic signal processing. We use synthetically generated training data, introducing a domain-randomization-based pipeline that constructs diverse acoustic scenes with temporally strong labels. We generate over 8.8 thousand hours of strongly-labeled audio and train a query-by-example, transformer-based model to perform few-shot bioacoustic sound event detection. Our second contribution is a public benchmark of 13 diverse few-shot bioacoustics tasks. Our model outperforms previously published methods by 49%, and we demonstrate that this is due to both model design and data scale. We make our trained model available via an API, to provide ecologists and ethologists with a training-free tool for bioacoustic sound event detection.