Exploring the Potential of SSL Models for Sound Event Detection

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically investigates the representational capacity and fusion mechanisms of self-supervised learning (SSL) models for sound event detection (SED). To address challenges in SSL feature selection and ensemble optimization, we propose a multi-level fusion framework comprising: (i) individual embedding ensembles, (ii) a novel dual-modality SSL fusion strategy (e.g., CRNN+BEATs+WavLM), and (iii) a full-aggregation scheme. Additionally, we introduce normalized Sound Event Boundary Boxes (nSEBBs), a post-processing method enabling dynamic boundary refinement. Evaluated on the DCASE 2023 Task 4 benchmark, CRNN+BEATs achieves state-of-the-art single-model performance. Dual-modality fusion substantially enhances complementary representation learning. The nSEBBs method improves the PSDS1 score by up to 4%, significantly boosting detection robustness and boundary localization accuracy.

Technology Category

Application Category

📝 Abstract
Self-supervised learning (SSL) models offer powerful representations for sound event detection (SED), yet their synergistic potential remains underexplored. This study systematically evaluates state-of-the-art SSL models to guide optimal model selection and integration for SED. We propose a framework that combines heterogeneous SSL representations (e.g., BEATs, HuBERT, WavLM) through three fusion strategies: individual SSL embedding integration, dual-modal fusion, and full aggregation. Experiments on the DCASE 2023 Task 4 Challenge reveal that dual-modal fusion (e.g., CRNN+BEATs+WavLM) achieves complementary performance gains, while CRNN+BEATs alone delivers the best results among individual SSL models. We further introduce normalized sound event bounding boxes (nSEBBs), an adaptive post-processing method that dynamically adjusts event boundary predictions, improving PSDS1 by up to 4% for standalone SSL models. These findings highlight the compatibility and complementarity of SSL architectures, providing guidance for task-specific fusion and robust SED system design.
Problem

Research questions and friction points this paper is trying to address.

Evaluating SSL models for optimal sound event detection
Proposing fusion strategies for heterogeneous SSL representations
Introducing nSEBBs to improve event boundary predictions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines heterogeneous SSL representations via fusion strategies
Introduces normalized sound event bounding boxes (nSEBBs)
Systematically evaluates SSL models for optimal SED integration
🔎 Similar Papers
No similar papers found.
H
Hanfang Cui
Shanghai Normal University, Shanghai, China
L
Longfei Song
Shanghai Normal University, Shanghai, China
L
Li Li
Shanghai Normal University, Shanghai, China
D
Dongxing Xu
Unisound AI Technology Co., Ltd., Beijing, China
Yanhua Long
Yanhua Long
Professor, Shanghai Normal University
Speech signal processing