A Transformer Model for Segmentation, Classification, and Caller Identification of Marmoset Vocalization

๐Ÿ“… 2024-10-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Marmoset vocalization data suffer from high noise levels, sparse annotations, and poor structural organization, hindering joint modeling of vocal segmentation, classification, and caller identification. Method: This work introduces the Transformer architecture to marmoset acoustic analysis for the first time, proposing an end-to-end multi-task learning model that takes spectrograms as input. Leveraging self-attention mechanisms, the model explicitly captures long-range temporal dependencies and cross-vocal-unit relationships, overcoming CNNsโ€™ limitations in global structural modeling. Results: Evaluated on real-world low-resource marmoset recordings, the model jointly optimizes three tasksโ€”vocal segmentation (F1 score โ†‘), call-type classification (accuracy โ†‘), and caller identification (caller ID accuracy โ†‘)โ€”with all metrics significantly outperforming CNN baselines. This study establishes a scalable, transformer-based acoustic analysis paradigm for investigating social communication and language development mechanisms in non-human primates.

Technology Category

Application Category

๐Ÿ“ Abstract
Marmoset, a highly vocalized primate, has become a popular animal model for studying social-communicative behavior and its underlying mechanism comparing with human infant linguistic developments. In the study of vocal communication, it is vital to know the caller identities, call contents, and vocal exchanges. Previous work of a CNN has achieved a joint model for call segmentation, classification, and caller identification for marmoset vocalizations. However, the CNN has limitations in modeling long-range acoustic patterns; the Transformer architecture that has been shown to outperform CNNs, utilizes the self-attention mechanism that efficiently segregates information parallelly over long distances and captures the global structure of marmoset vocalization. We propose using the Transformer to jointly segment and classify the marmoset calls and identify the callers for each vocalization.
Problem

Research questions and friction points this paper is trying to address.

Segmenting highly variable marmoset vocalizations in noisy conditions
Classifying and identifying marmoset callers with limited annotated data
Modeling long-range temporal dependencies in non-human primate communication
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Autoencoder for self-supervised pretraining
Transformers with self-attention for global dependencies
Reconstruction of masked segments from unannotated recordings
๐Ÿ”Ž Similar Papers
No similar papers found.