🤖 AI Summary
To address the low accuracy and poor generalizability of dysarthria recognition in amyotrophic lateral sclerosis (ALS) patients, this paper proposes the first hypernetwork-driven end-to-end speech analysis framework specifically designed for this task. Methodologically, the framework takes log-Mel spectrograms along with their first- and second-order derivatives (Δ/ΔΔ) as input and employs a lightweight hypernetwork to dynamically generate conditional weights for a fine-tuned AlexNet backbone, enabling input-adaptive parameter adaptation. This is the first application of hypernetworks to ALS-related speech pathology recognition, jointly optimizing parameter efficiency, cross-subject generalization, and robustness to acoustic noise. Evaluated on the public VOC-ALS dataset, the framework achieves 82.66% classification accuracy—significantly outperforming strong multimodal fusion baselines. Ablation studies confirm the critical contributions of both the hypernetwork mechanism and the spectrotemporal feature design.
📝 Abstract
Amyotrophic Lateral Sclerosis (ALS) constitutes a progressive neurodegenerative disease with varying symptoms, including decline in speech intelligibility. Existing studies, which recognize dysarthria in ALS patients by predicting the clinical standard ALSFRS-R, rely on feature extraction strategies and the design of customized convolutional neural networks followed by dense layers. However, recent studies have shown that neural networks adopting the logic of input-conditional computations enjoy a series of benefits, including faster training, better performance, and flexibility. To resolve these issues, we present the first study incorporating hypernetworks for recognizing dysarthria. Specifically, we use audio files, convert them into log-Mel spectrogram, delta, and delta-delta, and pass the resulting image through a pretrained modified AlexNet model. Finally, we use a hypernetwork, which generates weights for a target network. Experiments are conducted on a newly collected publicly available dataset, namely VOC-ALS. Results showed that the proposed approach reaches Accuracy up to 82.66% outperforming strong baselines, including multimodal fusion methods, while findings from an ablation study demonstrated the effectiveness of the introduced methodology. Overall, our approach incorporating hypernetworks obtains valuable advantages over state-of-the-art results in terms of generalization ability, parameter efficiency, and robustness.