Probing Spatial Structure in Pretrained Audio Representations

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of how well existing pretrained audio representations encode spatial attributes of both sound sources and room environments. It introduces SARL (Spatial Audio Representation Learning), the first open-source benchmark for this purpose, which employs probing tasks, sensitivity analyses, and controlled perturbations to assess pretrained models’ sensitivity to source-related factors (azimuth, elevation, distance, and category) and room-related factors (reverberation time, volume, and geometry). The study reveals that input configurations and training paradigms significantly influence spatial encoding capabilities, that source-related information is generally more decodable than room-related cues, and that models exhibit substantial heterogeneity in their responses to spatial variations—highlighting systematic biases in current audio representations.

📝 Abstract

Pretrained spatial audio encoders are increasingly used as general-purpose representations for perceptual tasks, yet their spatial encoding capabilities remain poorly understood. We introduce the Spatial Audio Representation Learning (SARL) benchmark, a controlled framework for evaluating spatial information in pretrained audio models. SARL probes source-level factors (azimuth, elevation, distance, class) and room-level factors (RT60, volume, shape). Experiments across diverse encoders reveal three patterns: input configuration and training paradigm shape spatial encoding; source factors are consistently easier to decode than room factors; and sensitivity analysis under controlled perturbations shows heterogeneous responses to source and room variation. These results reveal systematic biases in current pretrained audio representations. SARL is released as an open-source benchmark for reproducible evaluation of spatial audio representations.

Problem

Research questions and friction points this paper is trying to address.

spatial audio

pretrained representations

audio encoders

spatial encoding

representation evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial audio

representation learning

benchmarking