Probing Spatial Structure in Pretrained Audio Representations

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

224K/year
🤖 AI Summary
This work addresses the lack of systematic evaluation of how well existing pretrained audio representations encode spatial attributes of both sound sources and room environments. It introduces SARL (Spatial Audio Representation Learning), the first open-source benchmark for this purpose, which employs probing tasks, sensitivity analyses, and controlled perturbations to assess pretrained models’ sensitivity to source-related factors (azimuth, elevation, distance, and category) and room-related factors (reverberation time, volume, and geometry). The study reveals that input configurations and training paradigms significantly influence spatial encoding capabilities, that source-related information is generally more decodable than room-related cues, and that models exhibit substantial heterogeneity in their responses to spatial variations—highlighting systematic biases in current audio representations.
📝 Abstract
Pretrained spatial audio encoders are increasingly used as general-purpose representations for perceptual tasks, yet their spatial encoding capabilities remain poorly understood. We introduce the Spatial Audio Representation Learning (SARL) benchmark, a controlled framework for evaluating spatial information in pretrained audio models. SARL probes source-level factors (azimuth, elevation, distance, class) and room-level factors (RT60, volume, shape). Experiments across diverse encoders reveal three patterns: input configuration and training paradigm shape spatial encoding; source factors are consistently easier to decode than room factors; and sensitivity analysis under controlled perturbations shows heterogeneous responses to source and room variation. These results reveal systematic biases in current pretrained audio representations. SARL is released as an open-source benchmark for reproducible evaluation of spatial audio representations.
Problem

Research questions and friction points this paper is trying to address.

spatial audio
pretrained representations
audio encoders
spatial encoding
representation evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial audio
representation learning
benchmarking
pretrained audio models
controlled probing
🔎 Similar Papers
No similar papers found.