🤖 AI Summary
Existing global metrics struggle to diagnose failure modes in binaural audio novel-view synthesis models. This work proposes the first full-reference, model-agnostic diagnostic framework that enables time–frequency visualization of multidimensional errors—including magnitude, interaural level difference (ILD), interaural phase difference (IPD), temporal misalignment, loudness discrepancies, and high-frequency distortions—through an interpretable 3D audio error map (3DAE Map). The authors further establish a unified evaluation benchmark, 3DAE Bench. Experiments on the Replay-NVAS and SoundSpaces datasets reveal distinct dominant failure modes: temporal misalignment predominates in Replay-NVAS, whereas ILD mismatch is primary in SoundSpaces, offering fine-grained guidance for targeted model improvement.
📝 Abstract
3D audio and novel-view acoustic synthesis models are usually evaluated with global metrics.However, global metrics often hide where and why binaural prediction fails. We propose a full-reference diagnostic framework that uses time-frequency audio error maps for magnitude, ILD, IPD, temporal alignment, loudness, and high-frequency failures, forming a 3D Audio Error Map (3DAE Map) for visual inspection. We frame these diagnostics into a model-agnostic benchmark, Spatial Audio Error Bench (3DAE Bench), which takes arbitrary ground-truth and predicted binaural pairs and reports the prediction quality of audio novel-view synthesis models. Experiments on ViGAS outputs over Replay-NVAS and SoundSpaces show different dominant failure modes: temporal misalignment on Replay-NVAS and ILD mismatch on SoundSpaces. Overall, the framework provides interpretable failure-mode summaries and intuitive visual maps for audio Novel-view-synthesis model development optimization.