WhiSQA: Non-Intrusive Speech Quality Prediction Using Whisper Encoder Features

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Current speech enhancement evaluation relies heavily on reference clean speech, limiting practical applicability. To address this, we propose a fully non-intrusive (reference-free) speech quality prediction method. Our approach leverages hidden-layer representations from the encoder of the pre-trained ASR model Whisper as input features and employs a lightweight neural network for end-to-end quality modeling. Crucially, it requires no access to clean reference speech and supports joint optimization with downstream tasks. Evaluated on the full NISQA benchmark, our method achieves significantly higher correlation with subjective MOS scores than state-of-the-art intrusive and non-intrusive metrics—including DNSMOS—particularly under cross-domain conditions, demonstrating superior robustness and generalization. This work establishes a new reference-free paradigm for speech quality assessment that combines high predictive accuracy, strong domain adaptability, and computational efficiency, enabling seamless deployment in real-world systems.

Technology Category

Application Category

📝 Abstract

There has been significant research effort developing neural-network-based predictors of SQ in recent years. While a primary objective has been to develop non-intrusive, i.e.~reference-free, metrics to assess the performance of SE systems, recent work has also investigated the direct inference of neural SQ predictors within the loss function of downstream speech tasks. To aid in the training of SQ predictors, several large datasets of audio with corresponding human labels of quality have been created. Recent work in this area has shown that speech representations derived from large unsupervised or semi-supervised foundational speech models are useful input feature representations for neural SQ prediction. In this work, a novel and robust SQ predictor is proposed based on feature representations extracted from an ASR model, found to be a powerful input feature for the SQ prediction task. The proposed system achieves higher correlation with human MOS ratings than recent approaches on all NISQA test sets and shows significantly better domain adaption compared to the commonly used DNSMOS metric.

Problem

Research questions and friction points this paper is trying to address.

Develop non-intrusive speech quality prediction metrics

Utilize ASR model features for robust SQ prediction

Improve correlation with human MOS ratings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Whisper encoder features for prediction

Non-intrusive speech quality assessment

Achieves higher correlation with MOS

🔎 Similar Papers

Privacy in Speech Technology