🤖 AI Summary
Self-supervised speech representations lack interpretability and clinical credibility for Parkinson’s disease (PD) diagnosis. Method: We propose the first dual-granularity (embedding-level and temporal-level) interpretability framework, leveraging cross-modal cross-attention to couple self-supervised representations (e.g., Wav2Vec 2.0, HuBERT) with clinically grounded speech pathology markers, enabling traceable attribution from acoustic features to clinical semantics. Contribution/Results: Our method is the first to systematically endow multi-task PD assessment with semantic intelligibility and decision transparency. It achieves state-of-the-art classification accuracy on five authoritative PD speech benchmarks. Furthermore, it demonstrates robustness and generalizability across languages and spontaneous speech scenarios—validating its applicability in real-world clinical settings. This work establishes a novel paradigm for trustworthy, interpretable speech-based PD辅助 diagnosis.
📝 Abstract
Recent works in pathological speech analysis have increasingly relied on powerful self-supervised speech representations, leading to promising results. However, the complex, black-box nature of these embeddings and the limited research on their interpretability significantly restrict their adoption for clinical diagnosis. To address this gap, we propose a novel, interpretable framework specifically designed to support Parkinson's Disease (PD) diagnosis. Through the design of simple yet effective cross-attention mechanisms for both embedding- and temporal-level analysis, the proposed framework offers interpretability from two distinct but complementary perspectives. Experimental findings across five well-established speech benchmarks for PD detection demonstrate the framework's capability to identify meaningful speech patterns within self-supervised representations for a wide range of assessment tasks. Fine-grained temporal analyses further underscore its potential to enhance the interpretability of deep-learning pathological speech models, paving the way for the development of more transparent, trustworthy, and clinically applicable computer-assisted diagnosis systems in this domain. Moreover, in terms of classification accuracy, our method achieves results competitive with state-of-the-art approaches, while also demonstrating robustness in cross-lingual scenarios when applied to spontaneous speech production.