🤖 AI Summary
This study addresses Speech Intelligibility Prediction for Hearing-Impaired listeners (SIP-HI), systematically optimizing the adaptation of Speech Foundation Models (SFMs). Methodologically, layer-wise sensitivity analysis reveals that a single-layer encoder outperforms full-stack encoders; a temporal-aware prediction head (LSTM/TCN) is designed and empirically validated as critical; and a multi-SFM weighted ensemble is proposed, augmented with attribution analysis to uncover intrinsic model properties governing SIP-HI performance. Evaluated across five mainstream SFMs, the approach achieves significant accuracy gains and establishes new state-of-the-art results on multiple hearing-loss simulation datasets. Key contributions include: (i) the first interpretable mapping between SFM architectural characteristics and SIP-HI performance; and (ii) a lightweight, efficient, and reusable SFM adaptation paradigm tailored for auditory intelligibility modeling.
📝 Abstract
Speech foundation models (SFMs) have demonstrated strong performance across a variety of downstream tasks, including speech intelligibility prediction for hearing-impaired people (SIP-HI). However, optimizing SFMs for SIP-HI has been insufficiently explored. In this paper, we conduct a comprehensive study to identify key design factors affecting SIP-HI performance with 5 SFMs, focusing on encoder layer selection, prediction head architecture, and ensemble configurations. Our findings show that, contrary to traditional use-all-layers methods, selecting a single encoder layer yields better results. Additionally, temporal modeling is crucial for effective prediction heads. We also demonstrate that ensembling multiple SFMs improves performance, with stronger individual models providing greater benefit. Finally, we explore the relationship between key SFM attributes and their impact on SIP-HI performance. Our study offers practical insights into effectively adapting SFMs for speech intelligibility prediction for hearing-impaired populations.