SoundSculpt: Direction and Semantics Driven Ambisonic Target Sound Extraction

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the vulnerability of target source extraction in Ambisonic mixed soundfields to near-field interference, this paper proposes SoundSculpt: a spatial-semantic jointly conditioned Ambisonic-in-Ambisonic-out neural network. Methodologically, it introduces the first integration of user-specified spatial direction with multimodal semantic cues—namely, CLIP image embeddings, image segmentation masks, and caption text encodings—via a dual-path semantic encoder and a spatial-conditioned encoder. Trained jointly on synthetic and real Ambisonic data, SoundSculpt significantly improves extraction robustness, especially under spatially overlapping interference: it achieves a 2.1 dB SDR gain over single-condition baselines and consistently outperforms conventional signal processing methods. The core contribution lies in the novel co-driven paradigm wherein spatial priors and heterogeneous semantic embeddings are synergistically leveraged, establishing a new framework for target soundfield separation in immersive audio.

Technology Category

Application Category

📝 Abstract
This paper introduces SoundSculpt, a neural network designed to extract target sound fields from ambisonic recordings. SoundSculpt employs an ambisonic-in-ambisonic-out architecture and is conditioned on both spatial information (e.g., target direction obtained by pointing at an immersive video) and semantic embeddings (e.g., derived from image segmentation and captioning). Trained and evaluated on synthetic and real ambisonic mixtures, SoundSculpt demonstrates superior performance compared to various signal processing baselines. Our results further reveal that while spatial conditioning alone can be effective, the combination of spatial and semantic information is beneficial in scenarios where there are secondary sound sources spatially close to the target. Additionally, we compare two different semantic embeddings derived from a text description of the target sound using text encoders.
Problem

Research questions and friction points this paper is trying to address.

Extracts target sound fields from ambisonic recordings
Uses spatial and semantic information for sound extraction
Improves performance when secondary sources are near target
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ambisonic-in-ambisonic-out neural network architecture
Spatial and semantic conditioning for sound extraction
Combines target direction and text-derived embeddings
🔎 Similar Papers
No similar papers found.