🤖 AI Summary
Existing speech quality assessment (SQA) methods overly rely on final-layer representations from self-supervised speech models (e.g., Wav2Vec2, HuBERT, WavLM), overlooking the potential of intermediate layers for mean opinion score (MOS) prediction.
Method: We systematically investigate the MOS-prediction efficacy of hidden-layer representations across multiple SSL models and propose an end-to-end MOS prediction framework featuring hierarchical feature extraction and a lightweight regression head.
Contribution/Results: Empirical evaluation reveals that early-to-mid layer features significantly outperform—or match—the predictive accuracy of final-layer features, challenging the “last-layer optimality” assumption. This yields state-of-the-art performance on benchmarks including DNSMOS and VoiceMOS while reducing computational cost and model complexity. Crucially, our work is the first to systematically validate and exploit shallow-layer SSL representations—striking a semantic-acoustic balance—for efficient and interpretable SQA, establishing a new paradigm for resource-aware quality modeling.
📝 Abstract
Self-supervised learning (SSL) models like Wav2Vec2, HuBERT, and WavLM have been widely used in speech processing. These transformer-based models consist of multiple layers, each capturing different levels of representation. While prior studies explored their layer-wise representations for efficiency and performance, speech quality assessment (SQA) models predominantly rely on last-layer features, leaving intermediate layers underexamined. In this work, we systematically evaluate different layers of multiple SSL models for predicting mean-opinion-score (MOS). Features from each layer are fed into a lightweight regression network to assess effectiveness. Our experiments consistently show early-layers features outperform or match those from the last layer, leading to significant improvements over conventional approaches and state-of-the-art MOS prediction models. These findings highlight the advantages of early-layer selection, offering enhanced performance and reduced system complexity.