🤖 AI Summary
This work addresses the challenge of insufficient accuracy in speaker distance estimation caused by the scarcity of real room impulse response (RIR) data. To overcome this limitation, the authors propose a generative data augmentation approach that synthesizes RIRs conditioned on speaker and listener positions using FastRIR. A quality filtering mechanism is introduced to ensure the generated RIRs align closely with the distribution of real RIRs. The augmented dataset is then used to fine-tune a deep distance estimation model, with hyperparameters optimized for performance. The proposed method significantly improves estimation accuracy at medium to long distances, reducing the mean absolute error from 1.66 m to 0.60 m in GWA rooms and from 2.18 m to 0.69 m in Treble rooms, thereby demonstrating both effectiveness and strong generalization capability.
📝 Abstract
The Room Acoustics and Speaker Distance Estimation (SDE) Challenge at ICASSP 2025 explores the effectiveness of augmented room impulse response (RIR) data for improving SDE model performance. This challenge at GenDARA involves generating RIRs to supplement sparse datasets and fine-tuning SDE models with the augmented data. We employ the open-source fast diffuse room impulse response generator (FastRIR) conditioned only on speaker and listener locations. We design a quality filter to ensure generated RIR alignment with challenge RIRs, and hyperparameter optimization is employed for model fine-tuning. Our approach reduces the mean absolute error (MAE) of the five positions from 1.66m to 0.6m for GWA rooms and from 2.18m to 0.69m for Treble rooms, with results demonstrating that the augmentation approach significantly improves estimation accuracy, particularly at medium to long distances.