SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction

📅 2024-07-01
🏛️ International Conference on Signal Processing and Communications
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In speech quality assessment, conventional objective metrics (e.g., PESQ, STOI) exhibit weak correlation with subjective Mean Opinion Score (MOS) ratings, while MOS collection remains costly and time-consuming. To address this, we propose a lightweight end-to-end MOS prediction model. Our method introduces speaker-agnostic downsampled latent feature representations to ensure strong cross-speaker generalization; employs multi-layer convolution directly on raw waveforms for temporal feature extraction—bypassing ASR/TTS frontends and handcrafted features; and performs end-to-end regression to predict 5-point MOS scores. Evaluated on mainstream benchmarks, the model achieves state-of-the-art performance: MSE < 0.15, Linear Correlation Coefficient (LCC) > 0.92, Spearman Rank Correlation Coefficient (SRCC) > 0.91, and Kendall’s Tau (KTAU) > 0.78. The approach thus delivers a favorable trade-off among accuracy, computational efficiency, and scalability.

Technology Category

Application Category

📝 Abstract
Speech quality assessment is a critical process in selecting text-to-speech synthesis (TTS) or voice conversion models. Evaluation of voice synthesis can be done using objective metrics or subjective metrics. Although there are many objective metrics like Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Assessment (POLQA) or Short-Time Objective Intelligibility (STOI) but none of them is feasible in selecting the best model. On the other hand subjective metric like Mean Opinion Score is highly reliable but it requires a lot of manual efforts and is time consuming. To counter the issues in MOS Evaluation, we have developed a novel model, Speaker Agnostic Latent Features (SALF)-Mean Opinion Score (MOS) which is small size, end-to-end, highly generalized and scalable model for predicting MOS score on a scale of 5. We use the sequences of convolutions and stack them to get the latent features of the audio samples to get the best state-of-the-art results based on mean squared error (MSE), Linear Concordance Correlation coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall Rank Correlation Coefficient (KTAU).
Problem

Research questions and friction points this paper is trying to address.

Predicts MOS score for speech quality assessment
Overcomes limitations of manual MOS evaluation
Uses latent features for scalable model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speaker Agnostic Latent Features for MOS
End-to-end convolutional feature extraction
Generalized scalable model for MOS prediction
🔎 Similar Papers
No similar papers found.
Saurabh Agrawal
Saurabh Agrawal
Honeywell
R
R. Gohil
Samsung R&D Institute Bangalore, India
G
Gopal Kumar Agrawal
Samsung R&D Institute Bangalore, India
V
Vikram C M
Samsung R&D Institute Bangalore, India
K
Kushal Verma
Samsung R&D Institute Bangalore, India