π€ AI Summary
To address the low spatial upsampling accuracy and difficulty in personalization under sparse HRTF measurements (only 3β5 directions), this paper proposes Retrieval-Augmented Neural Fields (RANF). RANF introduces a novel βretrieveβfuseβ paradigm: it first retrieves multi-directional HRTFs of semantically similar subjects from a large-scale HRTF database, then aligns cross-subject features and fuses multi-channel implicit fields via a transform-average-concatenate architecture. Built upon neural implicit field modeling, RANF integrates HRTF semantic matching and fine-tuning on the SONICOM dataset to achieve high-fidelity spatial upsampling and subject-specific adaptation from minimal samples. As the core component of the winning solution in Task 2 of the 2024 Listener Acoustic Personalization Challenge, RANF significantly improves upsampling accuracy. It establishes a scalable, highly generalizable framework for low-sampling-rate HRTF modeling.
π Abstract
Head-related transfer functions (HRTFs) with dense spatial grids are desired for immersive binaural audio generation, but their recording is time-consuming. Although HRTF spatial upsampling has shown remarkable progress with neural fields, spatial upsampling only from a few measured directions, e.g., 3 or 5 measurements, is still challenging. To tackle this problem, we propose a retrieval-augmented neural field (RANF). RANF retrieves a subject whose HRTFs are close to those of the target subject from a dataset. The HRTF of the retrieved subject at the desired direction is fed into the neural field in addition to the sound source direction itself. Furthermore, we present a neural network that can efficiently handle multiple retrieved subjects, inspired by a multi-channel processing technique called transform-average-concatenate. Our experiments confirm the benefits of RANF on the SONICOM dataset, and it is a key component in the winning solution of Task 2 of the listener acoustic personalization challenge 2024.