🤖 AI Summary
Existing audio-language models are constrained by monaural input, limiting their capacity to model spatial sound fields. To address this, Sci-Phi introduces the first end-to-end large language model for spatial audio, featuring a dual-path spatial-spectral encoder that jointly parses multi-source spatial-semantic attributes—including source type, azimuth, distance, loudness, and temporal structure—as well as room reverberation parameters directly from first-order Ambisonics signals. Trained on 4,000 hours of synthetic Ambisonics data, the model employs permutation-invariant evaluation and is optimized using 15 fine-grained metrics, ensuring strong generalization to real-world room impulse responses. Experiments demonstrate that a single forward pass accurately characterizes up to four directional sound sources plus background noise in full spatial-semantic detail. The model exhibits robust performance under challenging conditions, including additive noise, high reverberation, and spectral similarity among concurrent sources.
📝 Abstract
Acoustic scene perception involves describing the type of sounds, their timing, their direction and distance, as well as their loudness and reverberation. While audio language models excel in sound recognition, single-channel input fundamentally limits spatial understanding. This work presents Sci-Phi, a spatial audio large language model with dual spatial and spectral encoders that estimates a complete parameter set for all sound sources and the surrounding environment. Learning from over 4,000 hours of synthetic first-order Ambisonics recordings including metadata, Sci-Phi enumerates and describes up to four directional sound sources in one pass, alongside non-directional background sounds and room characteristics. We evaluate the model with a permutation-invariant protocol and 15 metrics covering content, location, timing, loudness, and reverberation, and analyze its robustness across source counts, signal-to-noise ratios, reverberation levels, and challenging mixtures of acoustically, spatially, or temporally similar sources. Notably, Sci-Phi generalizes to real room impulse responses with only minor performance degradation. Overall, this work establishes the first audio LLM capable of full spatial-scene description, with strong potential for real-world deployment. Demo: https://sci-phi-audio.github.io/demo