π€ AI Summary
This study addresses the sampling-rate sensitivity of Mean Opinion Score (MOS) prediction in cross-sampling-rate speech quality assessment. We propose a robust end-to-end model comprising three key components: (1) self-supervised speech representations (e.g., wav2vec 2.0) for extracting sampling-rate-agnostic acoustic features; (2) a selective state space model (Mamba) to enhance long-range temporal modeling; and (3) a novel continuous Gaussian radial basis function (RBF) encoding of ground-truth MOS values to mitigate regression bias induced by discrete rating scales. The method substantially reduces dependency on input sampling rate. On the AudioMOS Challenge 2025 few-shot benchmark, our T16 system achieves ~14% improvement over the baseline and ranks fourth in system-level Spearmanβs rank correlation coefficient (SRCC). Further evaluation on the BVCC dataset demonstrates superior performance, confirming strong cross-sampling-rate generalization and practical deployment potential.
π Abstract
We propose MambaRate, which predicts Mean Opinion Scores (MOS) with limited bias regarding the sampling rate of the waveform under evaluation. It is designed for Track 3 of the AudioMOS Challenge 2025, which focuses on predicting MOS for speech in high sampling frequencies. Our model leverages self-supervised embeddings and selective state space modeling. The target ratings are encoded in a continuous representation via Gaussian radial basis functions (RBF). The results of the challenge were based on the system-level Spearman's Rank Correllation Coefficient (SRCC) metric. An initial MambaRate version (T16 system) outperformed the pre-trained baseline (B03) by ~14% in a few-shot setting without pre-training. T16 ranked fourth out of five in the challenge, differing by ~6% from the winning system. We present additional results on the BVCC dataset as well as ablations with different representations as input, which outperform the initial T16 version.