🤖 AI Summary
In speech quality assessment, conventional objective metrics (e.g., PESQ, STOI) exhibit weak correlation with subjective Mean Opinion Score (MOS) ratings, while MOS collection remains costly and time-consuming. To address this, we propose a lightweight end-to-end MOS prediction model. Our method introduces speaker-agnostic downsampled latent feature representations to ensure strong cross-speaker generalization; employs multi-layer convolution directly on raw waveforms for temporal feature extraction—bypassing ASR/TTS frontends and handcrafted features; and performs end-to-end regression to predict 5-point MOS scores. Evaluated on mainstream benchmarks, the model achieves state-of-the-art performance: MSE < 0.15, Linear Correlation Coefficient (LCC) > 0.92, Spearman Rank Correlation Coefficient (SRCC) > 0.91, and Kendall’s Tau (KTAU) > 0.78. The approach thus delivers a favorable trade-off among accuracy, computational efficiency, and scalability.
📝 Abstract
Speech quality assessment is a critical process in selecting text-to-speech synthesis (TTS) or voice conversion models. Evaluation of voice synthesis can be done using objective metrics or subjective metrics. Although there are many objective metrics like Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Assessment (POLQA) or Short-Time Objective Intelligibility (STOI) but none of them is feasible in selecting the best model. On the other hand subjective metric like Mean Opinion Score is highly reliable but it requires a lot of manual efforts and is time consuming. To counter the issues in MOS Evaluation, we have developed a novel model, Speaker Agnostic Latent Features (SALF)-Mean Opinion Score (MOS) which is small size, end-to-end, highly generalized and scalable model for predicting MOS score on a scale of 5. We use the sequences of convolutions and stack them to get the latent features of the audio samples to get the best state-of-the-art results based on mean squared error (MSE), Linear Concordance Correlation coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall Rank Correlation Coefficient (KTAU).