JSQA: Speech Quality Assessment with Perceptually-Inspired Contrastive Pretraining Based on JND Audio Pairs

📅 2025-07-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In speech quality assessment (SQA), subjective Mean Opinion Score (MOS) ratings suffer from high variance due to perceptual individuality and experimental design limitations, while existing models inadequately capture human auditory perception. To address this, we propose JSQA—a two-stage framework: first, audio pairs are constructed based on the Just-Noticeable Difference (JND) principle, enabling perception-inspired contrastive pretraining that explicitly incorporates psychoacoustic properties into the audio encoder’s representation learning; second, the pretrained model is fine-tuned on the NISQA dataset for MOS regression. JSQA is the first SQA method to systematically integrate JND into representation learning, eliminating reliance on large-scale manually annotated pretraining data. Experiments demonstrate that JSQA significantly outperforms a from-scratch baseline across multiple metrics in prediction accuracy and robustness—particularly excelling in cross-dataset generalization scenarios.

Technology Category

Application Category

📝 Abstract
Speech quality assessment (SQA) is often used to learn a mapping from a high-dimensional input space to a scalar that represents the mean opinion score (MOS) of the perceptual speech quality. Learning such a mapping is challenging for many reasons, but largely because MOS exhibits high levels of inherent variance due to perceptual and experimental-design differences. Many solutions have been proposed, but many approaches do not properly incorporate perceptual factors into their learning algorithms (beyond the MOS label), which could lead to unsatisfactory results. To this end, we propose JSQA, a two-stage framework that pretrains an audio encoder using perceptually-guided contrastive learning on just noticeable difference (JND) pairs, followed by fine-tuning for MOS prediction. We first generate pairs of audio data within JND levels, which are then used to pretrain an encoder to leverage perceptual quality similarity information and map it into an embedding space. The JND pairs come from clean LibriSpeech utterances that are mixed with background noise from CHiME-3, at different signal-to-noise ratios (SNRs). The encoder is later fine-tuned with audio samples from the NISQA dataset for MOS prediction. Experimental results suggest that perceptually-inspired contrastive pretraining significantly improves the model performance evaluated by various metrics when compared against the same network trained from scratch without pretraining. These findings suggest that incorporating perceptual factors into pretraining greatly contributes to the improvement in performance for SQA.
Problem

Research questions and friction points this paper is trying to address.

Improves speech quality assessment using perceptual contrastive pretraining
Addresses high variance in mean opinion scores (MOS) prediction
Leverages JND audio pairs for better perceptual quality embedding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive pretraining with JND audio pairs
Perceptually-guided encoder fine-tuning
SNR-based LibriSpeech and CHiME-3 mixing
🔎 Similar Papers
No similar papers found.
Junyi Fan
Junyi Fan
University of Southern California
machine learning
D
Donald Williamson
The Ohio State University, USA