🤖 AI Summary
This study addresses the non-intrusive automatic assessment of lyric intelligibility in singing voice. We propose the first end-to-end prediction framework that requires neither reference clean lyrics nor forced alignment information. Our method adapts the Whisper model—previously designed for speech recognition—to lyric intelligibility modeling, leveraging its robust phonetic and prosodic representations. A lightweight, trainable, differentiable regression backend is introduced to ensure stable scoring across diverse singing styles. Crucially, the approach eliminates reliance on time-aligned ground truth, enabling plug-and-play evaluation in unsupervised settings. Evaluated on the Cadenza CLIP test set, our method reduces RMSE by 22.4% relative to the STOI baseline and achieves significantly higher normalized cross-correlation, demonstrating both effectiveness and strong generalization across unseen vocal performances and styles.
📝 Abstract
We present LIWhiz, a non-intrusive lyric intelligibility prediction system submitted to the ICASSP 2026 Cadenza Challenge. LIWhiz leverages Whisper for robust feature extraction and a trainable back-end for score prediction. Tested on the Cadenza Lyric Intelligibility Prediction (CLIP) evaluation set, LIWhiz achieves a 22.4% relative root mean squared error reduction over the STOI-based baseline, yielding a substantial improvement in normalized cross-correlation.