NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics

📅 2024-11-11
🏛️ arXiv.org
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
While large language models (LLMs) have demonstrated strong performance on speech and music tasks, their potential for core bioacoustics applications—such as animal vocalization detection in long-duration recordings, endangered species identification, and behavioral annotation—remains systematically unvalidated. Method: We introduce BEANS, the first audio-text multimodal foundation model specifically designed for bioacoustics, enabling cross-domain transfer of knowledge pretrained on speech and music. We propose BEANS-Zero, a zero-shot evaluation benchmark supporting classification of unseen species, and employ joint modeling of multi-source text-audio aligned data via supervised fine-tuning and instruction-tuned prompting. Contribution/Results: BEANS achieves state-of-the-art performance across diverse bioacoustic tasks, notably outperforming prior methods in zero-shot species classification. All data, evaluation protocols, and training code are publicly released to foster reproducible research and community advancement.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) prompted with text and audio represent the state of the art in various auditory tasks, including speech, music, and general audio, showing emergent abilities on unseen tasks. However, these capabilities have yet to be fully demonstrated in bioacoustics tasks, such as detecting animal vocalizations in large recordings, classifying rare and endangered species, and labeling context and behavior - tasks that are crucial for conservation, biodiversity monitoring, and the study of animal behavior. In this work, we present NatureLM-audio, the first audio-language foundation model specifically designed for bioacoustics. Our carefully curated training dataset comprises text-audio pairs spanning a diverse range of bioacoustics, speech, and music data, designed to address the challenges posed by limited annotated datasets in the field. We demonstrate successful transfer of learned representations from music and speech to bioacoustics, and our model shows promising generalization to unseen taxa and tasks. Importantly, we test NatureLM-audio on a novel benchmark (BEANS-Zero) and it sets the new state of the art (SotA) on several bioacoustics tasks, including zero-shot classification of unseen species. To advance bioacoustics research, we also open-source the code for generating training and benchmark data, as well as for training the model.
Problem

Research questions and friction points this paper is trying to address.

Detecting animal vocalizations in large recordings
Classifying rare and endangered species
Labeling context and behavior in bioacoustics
Innovation

Methods, ideas, or system contributions that make the work stand out.

First audio-language model for bioacoustics tasks
Transfers representations from speech to bioacoustics
Zero-shot classification for unseen species
🔎 Similar Papers
No similar papers found.