🤖 AI Summary
To address the scarcity of high-quality, 360° audio-visual paired data for sound event localization and detection (SELD), this paper proposes SELDVisualSynth—the first synthetic data generation framework enabling precise audio-visual spatial alignment. Methodologically, it integrates 360° video rendering, physics-informed spatial sound source modeling, multimodal synchronized synthesis, and real background image augmentation to enhance visual realism; it further generates spatially aligned SELD ground-truth labels automatically. Experimental results demonstrate that models trained on SELDVisualSynth data achieve a 56.4% localization recall rate and a 21.9° mean angular error—substantially outperforming baseline methods. The framework is publicly released and has become a mainstream synthetic data platform within the SELD research community.
📝 Abstract
We present SELDVisualSynth, a tool for generating synthetic videos for audio-visual sound event localization and detection (SELD). Our approach incorporates real-world background images to improve realism in synthetic audio-visual SELD data while also ensuring audio-visual spatial alignment. The tool creates 360 synthetic videos where objects move matching synthetic SELD audio data and its annotations. Experimental results demonstrate that a model trained with this data attains performance gains across multiple metrics, achieving superior localization recall (56.4 LR) and competitive localization error (21.9deg LE). We open-source our data generation tool for maximal use by members of the SELD research community.