FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the significant performance degradation of current automatic speech recognition (ASR) systems on pathological speech caused by neurological disorders. The authors propose a parameter-efficient adaptation method that injects speaker-specific x-vector embeddings into each Transformer layer of a frozen-weight SpeechLLM encoder via feature-wise linear modulation (FiLM), enabling personalized modeling without updating the base model parameters. Evaluated on bilingual (English–Spanish) pathological speech, the approach achieves recognition performance comparable to state-of-the-art adaptation techniques while effectively preserving the model’s capability on typical speech and its generalization to spoken question-answering tasks.

📝 Abstract

Automatic speech recognition (ASR) has advanced remarkably for standard speech; however, pathological speech from neurological conditions remains a significant challenge. We investigate speaker conditioning via Feature-wise Linear Modulation (FiLM), injecting x-vector-derived information into each transformer layer of a frozen ASR encoder to adapt internal representations to individual pathological speakers without modifying base model weights. We benchmark this for the ASR task against standard and parameter-efficient fine-tuning baselines, complemented by post-processing, on Spanish and English pathological speech. Additionally, we evaluate if the adapted model preserves the ability to answer speech-related questions. Results show that speaker-conditioned ASR is competitive with established adaptation strategies while retaining performance on non-conditioned speech.

Problem

Research questions and friction points this paper is trying to address.

pathological speech recognition

speaker conditioning

automatic speech recognition

neurological disorders

speech adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

FiLM

speaker conditioning

pathological speech recognition