WhiSQA: Non-Intrusive Speech Quality Prediction Using Whisper Encoder Features

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current speech enhancement evaluation relies heavily on reference clean speech, limiting practical applicability. To address this, we propose a fully non-intrusive (reference-free) speech quality prediction method. Our approach leverages hidden-layer representations from the encoder of the pre-trained ASR model Whisper as input features and employs a lightweight neural network for end-to-end quality modeling. Crucially, it requires no access to clean reference speech and supports joint optimization with downstream tasks. Evaluated on the full NISQA benchmark, our method achieves significantly higher correlation with subjective MOS scores than state-of-the-art intrusive and non-intrusive metrics—including DNSMOS—particularly under cross-domain conditions, demonstrating superior robustness and generalization. This work establishes a new reference-free paradigm for speech quality assessment that combines high predictive accuracy, strong domain adaptability, and computational efficiency, enabling seamless deployment in real-world systems.

Technology Category

Application Category

📝 Abstract
There has been significant research effort developing neural-network-based predictors of SQ in recent years. While a primary objective has been to develop non-intrusive, i.e.~reference-free, metrics to assess the performance of SE systems, recent work has also investigated the direct inference of neural SQ predictors within the loss function of downstream speech tasks. To aid in the training of SQ predictors, several large datasets of audio with corresponding human labels of quality have been created. Recent work in this area has shown that speech representations derived from large unsupervised or semi-supervised foundational speech models are useful input feature representations for neural SQ prediction. In this work, a novel and robust SQ predictor is proposed based on feature representations extracted from an ASR model, found to be a powerful input feature for the SQ prediction task. The proposed system achieves higher correlation with human MOS ratings than recent approaches on all NISQA test sets and shows significantly better domain adaption compared to the commonly used DNSMOS metric.
Problem

Research questions and friction points this paper is trying to address.

Develop non-intrusive speech quality prediction metrics
Utilize ASR model features for robust SQ prediction
Improve correlation with human MOS ratings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Whisper encoder features for prediction
Non-intrusive speech quality assessment
Achieves higher correlation with MOS
🔎 Similar Papers
2023-05-09arXiv.orgCitations: 7