Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of Speech

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing speech quality assessment (SQA) methods overly rely on final-layer representations from self-supervised speech models (e.g., Wav2Vec2, HuBERT, WavLM), overlooking the potential of intermediate layers for mean opinion score (MOS) prediction. Method: We systematically investigate the MOS-prediction efficacy of hidden-layer representations across multiple SSL models and propose an end-to-end MOS prediction framework featuring hierarchical feature extraction and a lightweight regression head. Contribution/Results: Empirical evaluation reveals that early-to-mid layer features significantly outperform—or match—the predictive accuracy of final-layer features, challenging the “last-layer optimality” assumption. This yields state-of-the-art performance on benchmarks including DNSMOS and VoiceMOS while reducing computational cost and model complexity. Crucially, our work is the first to systematically validate and exploit shallow-layer SSL representations—striking a semantic-acoustic balance—for efficient and interpretable SQA, establishing a new paradigm for resource-aware quality modeling.

Technology Category

Application Category

📝 Abstract

Self-supervised learning (SSL) models like Wav2Vec2, HuBERT, and WavLM have been widely used in speech processing. These transformer-based models consist of multiple layers, each capturing different levels of representation. While prior studies explored their layer-wise representations for efficiency and performance, speech quality assessment (SQA) models predominantly rely on last-layer features, leaving intermediate layers underexamined. In this work, we systematically evaluate different layers of multiple SSL models for predicting mean-opinion-score (MOS). Features from each layer are fed into a lightweight regression network to assess effectiveness. Our experiments consistently show early-layers features outperform or match those from the last layer, leading to significant improvements over conventional approaches and state-of-the-art MOS prediction models. These findings highlight the advantages of early-layer selection, offering enhanced performance and reduced system complexity.

Problem

Research questions and friction points this paper is trying to address.

Evaluating SSL model layers for MOS prediction

Assessing early-layer features versus last-layer in SQA

Improving MOS prediction performance with layer selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes early-layer features from SSL models

Employs lightweight regression network for evaluation

Outperforms conventional last-layer feature approaches

🔎 Similar Papers

Towards Automatic Assessment of Self-Supervised Speech Models using Rank