Unveiling the Best Practices for Applying Speech Foundation Models to Speech Intelligibility Prediction for Hearing-Impaired People

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This study addresses Speech Intelligibility Prediction for Hearing-Impaired listeners (SIP-HI), systematically optimizing the adaptation of Speech Foundation Models (SFMs). Methodologically, layer-wise sensitivity analysis reveals that a single-layer encoder outperforms full-stack encoders; a temporal-aware prediction head (LSTM/TCN) is designed and empirically validated as critical; and a multi-SFM weighted ensemble is proposed, augmented with attribution analysis to uncover intrinsic model properties governing SIP-HI performance. Evaluated across five mainstream SFMs, the approach achieves significant accuracy gains and establishes new state-of-the-art results on multiple hearing-loss simulation datasets. Key contributions include: (i) the first interpretable mapping between SFM architectural characteristics and SIP-HI performance; and (ii) a lightweight, efficient, and reusable SFM adaptation paradigm tailored for auditory intelligibility modeling.

Technology Category

Application Category

📝 Abstract

Speech foundation models (SFMs) have demonstrated strong performance across a variety of downstream tasks, including speech intelligibility prediction for hearing-impaired people (SIP-HI). However, optimizing SFMs for SIP-HI has been insufficiently explored. In this paper, we conduct a comprehensive study to identify key design factors affecting SIP-HI performance with 5 SFMs, focusing on encoder layer selection, prediction head architecture, and ensemble configurations. Our findings show that, contrary to traditional use-all-layers methods, selecting a single encoder layer yields better results. Additionally, temporal modeling is crucial for effective prediction heads. We also demonstrate that ensembling multiple SFMs improves performance, with stronger individual models providing greater benefit. Finally, we explore the relationship between key SFM attributes and their impact on SIP-HI performance. Our study offers practical insights into effectively adapting SFMs for speech intelligibility prediction for hearing-impaired populations.

Problem

Research questions and friction points this paper is trying to address.

Optimizing speech foundation models for hearing-impaired intelligibility prediction

Identifying key design factors in encoder layers and prediction heads

Exploring ensemble methods to improve model performance for SIP-HI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Select single encoder layer for better performance

Use temporal modeling in prediction heads

Ensemble multiple SFMs to enhance results

🔎 Similar Papers

No similar papers found.