MambaRate: Speech Quality Assessment Across Different Sampling Rates

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This study addresses the sampling-rate sensitivity of Mean Opinion Score (MOS) prediction in cross-sampling-rate speech quality assessment. We propose a robust end-to-end model comprising three key components: (1) self-supervised speech representations (e.g., wav2vec 2.0) for extracting sampling-rate-agnostic acoustic features; (2) a selective state space model (Mamba) to enhance long-range temporal modeling; and (3) a novel continuous Gaussian radial basis function (RBF) encoding of ground-truth MOS values to mitigate regression bias induced by discrete rating scales. The method substantially reduces dependency on input sampling rate. On the AudioMOS Challenge 2025 few-shot benchmark, our T16 system achieves ~14% improvement over the baseline and ranks fourth in system-level Spearman’s rank correlation coefficient (SRCC). Further evaluation on the BVCC dataset demonstrates superior performance, confirming strong cross-sampling-rate generalization and practical deployment potential.

Technology Category

Application Category

📝 Abstract

We propose MambaRate, which predicts Mean Opinion Scores (MOS) with limited bias regarding the sampling rate of the waveform under evaluation. It is designed for Track 3 of the AudioMOS Challenge 2025, which focuses on predicting MOS for speech in high sampling frequencies. Our model leverages self-supervised embeddings and selective state space modeling. The target ratings are encoded in a continuous representation via Gaussian radial basis functions (RBF). The results of the challenge were based on the system-level Spearman's Rank Correllation Coefficient (SRCC) metric. An initial MambaRate version (T16 system) outperformed the pre-trained baseline (B03) by ~14% in a few-shot setting without pre-training. T16 ranked fourth out of five in the challenge, differing by ~6% from the winning system. We present additional results on the BVCC dataset as well as ablations with different representations as input, which outperform the initial T16 version.

Problem

Research questions and friction points this paper is trying to address.

Predicts speech quality scores across varying sampling rates

Focuses on high-frequency speech MOS prediction for AudioMOS Challenge

Improves baseline performance using self-supervised embeddings and state-space modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses self-supervised embeddings for MOS prediction

Leverages selective state space modeling

Encodes ratings via Gaussian RBF representation

🔎 Similar Papers

Wave-U-Mamba: An End-To-End Framework For High-Quality And Efficient Speech Super Resolution