🤖 AI Summary
To address the poor generalization of multi-sample-rate speech naturalness MOS prediction, this paper proposes a sampling-rate-agnostic self-supervised MOS prediction framework. The core innovation lies in a frequency-agnostic convolutional layer that decouples feature extraction from sampling-rate dependencies, coupled with large-scale MOS data pretraining and teacher-student knowledge distillation to enhance cross-sample-rate robustness. On the AMC 2025 Track 3 benchmark, our method achieves first place in the primary metric and fourth overall. Ablation studies confirm the critical contributions of both the frequency-agnostic layer and the distillation mechanism to prediction accuracy and generalization across sampling rates. This work establishes a scalable, highly robust, unified modeling paradigm for multi-sample-rate speech quality assessment.
📝 Abstract
We introduce our submission to the AudioMOS Challenge (AMC) 2025 Track 3: mean opinion score (MOS) prediction for speech with multiple sampling frequencies (SFs). Our submitted model integrates an SF-independent (SFI) convolutional layer into a self-supervised learning (SSL) model to achieve SFI speech feature extraction for MOS prediction. We present some strategies to improve the MOS prediction performance of our model: distilling knowledge from a pretrained non-SFI-SSL model and pretraining with a large-scale MOS dataset. Our submission to the AMC 2025 Track 3 ranked the first in one evaluation metric and the fourth in the final ranking. We also report the results of our ablation study to investigate essential factors of our model.