SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction

📅 2024-07-01

🏛️ International Conference on Signal Processing and Communications

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

In speech quality assessment, conventional objective metrics (e.g., PESQ, STOI) exhibit weak correlation with subjective Mean Opinion Score (MOS) ratings, while MOS collection remains costly and time-consuming. To address this, we propose a lightweight end-to-end MOS prediction model. Our method introduces speaker-agnostic downsampled latent feature representations to ensure strong cross-speaker generalization; employs multi-layer convolution directly on raw waveforms for temporal feature extraction—bypassing ASR/TTS frontends and handcrafted features; and performs end-to-end regression to predict 5-point MOS scores. Evaluated on mainstream benchmarks, the model achieves state-of-the-art performance: MSE < 0.15, Linear Correlation Coefficient (LCC) > 0.92, Spearman Rank Correlation Coefficient (SRCC) > 0.91, and Kendall’s Tau (KTAU) > 0.78. The approach thus delivers a favorable trade-off among accuracy, computational efficiency, and scalability.

Technology Category

Application Category

📝 Abstract

Speech quality assessment is a critical process in selecting text-to-speech synthesis (TTS) or voice conversion models. Evaluation of voice synthesis can be done using objective metrics or subjective metrics. Although there are many objective metrics like Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Assessment (POLQA) or Short-Time Objective Intelligibility (STOI) but none of them is feasible in selecting the best model. On the other hand subjective metric like Mean Opinion Score is highly reliable but it requires a lot of manual efforts and is time consuming. To counter the issues in MOS Evaluation, we have developed a novel model, Speaker Agnostic Latent Features (SALF)-Mean Opinion Score (MOS) which is small size, end-to-end, highly generalized and scalable model for predicting MOS score on a scale of 5. We use the sequences of convolutions and stack them to get the latent features of the audio samples to get the best state-of-the-art results based on mean squared error (MSE), Linear Concordance Correlation coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall Rank Correlation Coefficient (KTAU).

Problem

Research questions and friction points this paper is trying to address.

Predicts MOS score for speech quality assessment

Overcomes limitations of manual MOS evaluation

Uses latent features for scalable model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speaker Agnostic Latent Features for MOS

End-to-end convolutional feature extraction

Generalized scalable model for MOS prediction

🔎 Similar Papers

No similar papers found.