🤖 AI Summary
This study addresses the performance degradation in dysarthric speech recognition under federated learning caused by high speaker heterogeneity. To tackle this challenge, the work introduces personalized federated learning to this task for the first time and proposes two tailored aggregation strategies—parameter averaging and embedding averaging—that enhance recognition accuracy while preserving data privacy. The effectiveness of the proposed approach is validated on the UASpeech and TORGO datasets, achieving absolute word error rate reductions of up to 0.99% and 0.56%, respectively, which correspond to relative improvements of 3.15% and 4.73%.
📝 Abstract
Speech recognition is challenging for dysarthric speakers. While federated learning (FL)-based ASR can be an effective tool for protecting privacy, it suffers from heterogeneity issues caused by speaker variability. Forcing all speakers to share the same model components can be suboptimal under such heterogeneity, making personalization a promising direction; however, related research on dysarthric speech remains limited. To this end, this paper explores two aggregation strategies to achieve personalization, including the parameter-based averaging strategy and the embedding-based averaging strategy. Experiments on UASpeech and TORGO show that the proposed methods outperform the baseline regularized FedAvg by statistically significant WER reductions of up to 0.99% absolute (3.15% relative) on UASpeech and 0.56% absolute (4.73% relative) on TORGO, respectively.