🤖 AI Summary
Traditional personalized speech intelligibility prediction relies on inaccurate pure-tone audiograms, which poorly reflect actual speech understanding ability. This work abandons the audiogram-based paradigm and instead proposes the first approach that leverages users’ historical speech intelligibility scores for personalization. We introduce SSIPNet, a sample-driven framework that integrates semantic representations from a pre-trained speech foundation model and employs contrastive learning combined with meta-learning to enable few-shot cross-audio prediction. With only 3–5 support samples (audio clip, intelligibility score pairs), SSIPNet achieves significant improvements over audiogram-based baselines on the Clarity Prediction Challenge dataset—reducing mean prediction error by 28.6%. This establishes a novel, low-resource paradigm for personalized speech intelligibility assessment.
📝 Abstract
Personalized speech intelligibility prediction is challenging. Previous approaches have mainly relied on audiograms, which are inherently limited in accuracy as they only capture a listener's hearing threshold for pure tones. Rather than incorporating additional listener features, we propose a novel approach that leverages an individual's existing intelligibility data to predict their performance on new audio. We introduce the Support Sample-Based Intelligibility Prediction Network (SSIPNet), a deep learning model that leverages speech foundation models to build a high-dimensional representation of a listener's speech recognition ability from multiple support (audio, score) pairs, enabling accurate predictions for unseen audio. Results on the Clarity Prediction Challenge dataset show that, even with a small number of support (audio, score) pairs, our method outperforms audiogram-based predictions. Our work presents a new paradigm for personalized speech intelligibility prediction.