Spatial-Omni: Spatial Audio Understanding Integration in Multimodal LLMs via FOA Encoding

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the common oversight in existing audio language models of spatial audio cues—specifically, sound source localization and scene geometry. To bridge this gap, the authors propose Spatial-Omni, the first approach to incorporate first-order Ambisonics (FOA) spatial audio as a distinct modality within a multimodal large language model. They introduce a lightweight SO-Encoder to generate spatial tokens with minimal computational overhead and adopt a staged training strategy to enhance spatial understanding. A comprehensive evaluation suite—comprising SO-Dataset, SO-QA, and SO-Bench—is constructed from open-source, real-recorded, and simulated data, covering 16 subtasks. Experiments demonstrate that Spatial-Omni significantly outperforms current open-source audio language models and Omni baselines in spatial audio comprehension while maintaining strong general audio understanding capabilities.
📝 Abstract
Recent multimodal large language models mainly process audio as monaural signals, thereby discarding the spatial cues contained in spatial audio for sound localization, spatial relation reasoning, and spatial scene understanding. We propose Spatial-Omni, a lightweight method that implements SO-Encoder to inject First-Order Ambisonics (FOA) spatial audio into existing Omni LLMs as an independent modality, without modifying their original audio encoders. SO-Encoder provides spatial tokens with limited additional context cost and improves spatial audio understanding through efficient staged training. To support training and evaluation, we construct SO-Dataset, SO-QA, and SO-Bench from open-source data, real recordings, and simulations, containing 400K FOA spatial audio clips and 2.1M spatial question answering pairs. SO-Bench covers 16 spatial audio understanding subtasks, including basic detection and location estimation, spatial relation understanding, and complex spatial reasoning. Experiments show that Spatial-Omni outperforms existing open-source Large Audio-Language Models (LALMs) and Omni LLM models on spatial audio understanding tasks while retaining a reasonable level of general audio understanding. Code and data are available at https://github.com/dieKarotte/Spatial-Omni.
Problem

Research questions and friction points this paper is trying to address.

spatial audio
sound localization
spatial relation reasoning
multimodal LLMs
spatial scene understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial Audio Understanding
First-Order Ambisonics (FOA)
Multimodal LLMs
SO-Encoder
Spatial Reasoning
🔎 Similar Papers
2024-02-02International Conference on Machine LearningCitations: 14