Generating Diverse Audio-Visual 360 Soundscapes for Sound Event Localization and Detection

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of high-quality, 360° audio-visual paired data for sound event localization and detection (SELD), this paper proposes SELDVisualSynth—the first synthetic data generation framework enabling precise audio-visual spatial alignment. Methodologically, it integrates 360° video rendering, physics-informed spatial sound source modeling, multimodal synchronized synthesis, and real background image augmentation to enhance visual realism; it further generates spatially aligned SELD ground-truth labels automatically. Experimental results demonstrate that models trained on SELDVisualSynth data achieve a 56.4% localization recall rate and a 21.9° mean angular error—substantially outperforming baseline methods. The framework is publicly released and has become a mainstream synthetic data platform within the SELD research community.

Technology Category

Application Category

📝 Abstract
We present SELDVisualSynth, a tool for generating synthetic videos for audio-visual sound event localization and detection (SELD). Our approach incorporates real-world background images to improve realism in synthetic audio-visual SELD data while also ensuring audio-visual spatial alignment. The tool creates 360 synthetic videos where objects move matching synthetic SELD audio data and its annotations. Experimental results demonstrate that a model trained with this data attains performance gains across multiple metrics, achieving superior localization recall (56.4 LR) and competitive localization error (21.9deg LE). We open-source our data generation tool for maximal use by members of the SELD research community.
Problem

Research questions and friction points this paper is trying to address.

Generating diverse audio-visual 360 soundscapes for SELD
Ensuring spatial alignment in synthetic audio-visual data
Improving model performance with realistic synthetic training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic 360 videos with moving objects
Real-world background images enhance realism
Audio-visual spatial alignment ensured
🔎 Similar Papers
No similar papers found.