🤖 AI Summary
This work addresses the significant performance degradation of current automatic speech recognition (ASR) systems on pathological speech caused by neurological disorders. The authors propose a parameter-efficient adaptation method that injects speaker-specific x-vector embeddings into each Transformer layer of a frozen-weight SpeechLLM encoder via feature-wise linear modulation (FiLM), enabling personalized modeling without updating the base model parameters. Evaluated on bilingual (English–Spanish) pathological speech, the approach achieves recognition performance comparable to state-of-the-art adaptation techniques while effectively preserving the model’s capability on typical speech and its generalization to spoken question-answering tasks.
📝 Abstract
Automatic speech recognition (ASR) has advanced remarkably for standard speech; however, pathological speech from neurological conditions remains a significant challenge. We investigate speaker conditioning via Feature-wise Linear Modulation (FiLM), injecting x-vector-derived information into each transformer layer of a frozen ASR encoder to adapt internal representations to individual pathological speakers without modifying base model weights. We benchmark this for the ASR task against standard and parameter-efficient fine-tuning baselines, complemented by post-processing, on Spanish and English pathological speech. Additionally, we evaluate if the adapted model preserves the ability to answer speech-related questions. Results show that speaker-conditioned ASR is competitive with established adaptation strategies while retaining performance on non-conditioned speech.