Baseline Systems and Evaluation Metrics for Spatial Semantic Segmentation of Sound Scenes

📅 2025-03-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the DCASE 2025 Task S5 challenge: joint sound event detection, classification, and spatial localization—formulated as spatial semantic segmentation in immersive acoustic scenes. We propose AT-LSS, a unified baseline system integrating Audio Tagging (AT) and Label-Query-based Sound Source Separation (LSS). The architecture employs a dual-mode label-query separation framework built upon ResUNet and models spatial audio using First-Order Ambisonics (FOA). To overcome the limitation of conventional metrics—which separately assess event-level recognition and spatial-level separation—we introduce a novel class-aware joint evaluation metric that simultaneously quantifies event classification accuracy and separation fidelity. Evaluated on the FOA dataset, our metric demonstrates significantly improved capability in capturing synergistic performance across both dimensions. The proposed system establishes a reproducible, scalable benchmark framework for Task S5, enabling holistic assessment of spatially aware sound understanding.

Technology Category

Application Category

📝 Abstract

Immersive communication has made significant advancements, especially with the release of the codec for Immersive Voice and Audio Services. Aiming at its further realization, the DCASE 2025 Challenge has recently introduced a task for spatial semantic segmentation of sound scenes (S5), which focuses on detecting and separating sound events in spatial sound scenes. In this paper, we explore methods for addressing the S5 task. Specifically, we present baseline S5 systems that combine audio tagging (AT) and label-queried source separation (LSS) models. We investigate two LSS approaches based on the ResUNet architecture: a) extracting a single source for each detected event and b) querying multiple sources concurrently. Since each separated source in S5 is identified by its sound event class label, we propose new class-aware metrics to evaluate both the sound sources and labels simultaneously. Experimental results on first-order ambisonics spatial audio demonstrate the effectiveness of the proposed systems and confirm the efficacy of the metrics.

Problem

Research questions and friction points this paper is trying to address.

Detect and separate sound events in spatial scenes

Combine audio tagging and label-queried source separation

Evaluate separated sources with class-aware metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines audio tagging and source separation models

Uses ResUNet for single and multi-source extraction

Introduces class-aware metrics for evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow