Unveiling the Best Practices for Applying Speech Foundation Models to Speech Intelligibility Prediction for Hearing-Impaired People

📅 2025-05-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses Speech Intelligibility Prediction for Hearing-Impaired listeners (SIP-HI), systematically optimizing the adaptation of Speech Foundation Models (SFMs). Methodologically, layer-wise sensitivity analysis reveals that a single-layer encoder outperforms full-stack encoders; a temporal-aware prediction head (LSTM/TCN) is designed and empirically validated as critical; and a multi-SFM weighted ensemble is proposed, augmented with attribution analysis to uncover intrinsic model properties governing SIP-HI performance. Evaluated across five mainstream SFMs, the approach achieves significant accuracy gains and establishes new state-of-the-art results on multiple hearing-loss simulation datasets. Key contributions include: (i) the first interpretable mapping between SFM architectural characteristics and SIP-HI performance; and (ii) a lightweight, efficient, and reusable SFM adaptation paradigm tailored for auditory intelligibility modeling.

Technology Category

Application Category

📝 Abstract
Speech foundation models (SFMs) have demonstrated strong performance across a variety of downstream tasks, including speech intelligibility prediction for hearing-impaired people (SIP-HI). However, optimizing SFMs for SIP-HI has been insufficiently explored. In this paper, we conduct a comprehensive study to identify key design factors affecting SIP-HI performance with 5 SFMs, focusing on encoder layer selection, prediction head architecture, and ensemble configurations. Our findings show that, contrary to traditional use-all-layers methods, selecting a single encoder layer yields better results. Additionally, temporal modeling is crucial for effective prediction heads. We also demonstrate that ensembling multiple SFMs improves performance, with stronger individual models providing greater benefit. Finally, we explore the relationship between key SFM attributes and their impact on SIP-HI performance. Our study offers practical insights into effectively adapting SFMs for speech intelligibility prediction for hearing-impaired populations.
Problem

Research questions and friction points this paper is trying to address.

Optimizing speech foundation models for hearing-impaired intelligibility prediction
Identifying key design factors in encoder layers and prediction heads
Exploring ensemble methods to improve model performance for SIP-HI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Select single encoder layer for better performance
Use temporal modeling in prediction heads
Ensemble multiple SFMs to enhance results
🔎 Similar Papers
No similar papers found.
H
Haoshuai Zhou
Orka Labs Inc., China
Boxuan Cao
Boxuan Cao
Orka Lab Inc.
BiomedicalDeep LearningArtificial Intelligence
C
Changgeng Mo
Orka Labs Inc., China
Linkai Li
Linkai Li
Head of Engineering, Orka Inc
Signal ProcessingSpeech EnhancementBiomedical Optics
S
Shan Xiang Wang
Materials Science and Engineering, Stanford University, United States