🤖 AI Summary
Existing evaluation metrics for generative spatial audio lack systematic investigation into their response characteristics under variations in spatial parameters such as azimuth and elevation. This work proposes the first sensitivity analysis framework that evaluates multiple metrics along continuous spatial trajectories, introducing three key criteria: responsiveness, smoothness, and symmetry. The framework is empirically applied to metrics including Fréchet Audio Distance (FAD), intensity vectors, and acoustic maps within controlled scenes of varying complexity. Results demonstrate that FAD based on directional embeddings and acoustic maps consistently excel across all three criteria, whereas intensity vectors exhibit significant performance degradation as scene complexity increases. These findings reveal substantial differences in the sensitivity of existing metrics to spatial variations, offering critical insights for the design and selection of evaluation methods in spatial audio generation.
📝 Abstract
Evaluating generative spatial audio for First-Order Ambisonics (FOA) remains challenging due to a limited understanding of how metrics respond to changes in spatial parameters such as azimuth and elevation. We propose a framework to analyze metric sensitivity along continuous spatial trajectories, drawing on principles of sensitivity analysis in parametric sound synthesis. Using controlled FOA scenes with increasing scene complexity, we define three desiderata for metric behavior: Responsiveness, Smoothness, and Symmetry. We assess standard distribution-based and sample-based metrics, including Fréchet Audio Distance (FAD), intensity vectors, and acoustic maps. Our findings show that FAD using localization-specific embeddings and acoustic maps yield high Responsiveness and robust Smoothness and Symmetry across conditions, while intensity vectors degrade with increasing scene complexity. This is the first step towards investigating the sensitivity of metrics for generative spatial audio.