CogniAlign: Word-Level Multimodal Speech Alignment with Gated Cross-Attention for Alzheimer's Detection

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of early, non-invasive Alzheimer’s disease (AD) screening, where insufficient fine-grained alignment between audio and text modalities limits diagnostic accuracy. We propose a word-level cross-modal temporal alignment framework: (1) leveraging transcription timestamps to precisely synchronize speech embeddings with textual tokens; (2) introducing a gated cross-attention mechanism that uses text as guidance to enhance multimodal interaction; and (3) pioneering pause modeling by incorporating explicit *pause tokens* and generating acoustic representations for silent segments. The model is trained end-to-end on the ADReSSo dataset, achieving 90.36% classification accuracy—significantly surpassing state-of-the-art methods. Ablation studies confirm the individual contributions of word-level alignment, gated fusion, and prosodic (pause) modeling. Our approach establishes a novel, interpretable, and reproducible paradigm for multimodal cognitive impairment detection.

Technology Category

Application Category

📝 Abstract
Early detection of cognitive disorders such as Alzheimer's disease is critical for enabling timely clinical intervention and improving patient outcomes. In this work, we introduce CogniAlign, a multimodal architecture for Alzheimer's detection that integrates audio and textual modalities, two non-intrusive sources of information that offer complementary insights into cognitive health. Unlike prior approaches that fuse modalities at a coarse level, CogniAlign leverages a word-level temporal alignment strategy that synchronizes audio embeddings with corresponding textual tokens based on transcription timestamps. This alignment supports the development of token-level fusion techniques, enabling more precise cross-modal interactions. To fully exploit this alignment, we propose a Gated Cross-Attention Fusion mechanism, where audio features attend over textual representations, guided by the superior unimodal performance of the text modality. In addition, we incorporate prosodic cues, specifically interword pauses, by inserting pause tokens into the text and generating audio embeddings for silent intervals, further enriching both streams. We evaluate CogniAlign on the ADReSSo dataset, where it achieves an accuracy of 90.36%, outperforming existing state-of-the-art methods. A detailed ablation study confirms the advantages of our alignment strategy, attention-based fusion, and prosodic modeling.
Problem

Research questions and friction points this paper is trying to address.

Detects Alzheimer's using aligned audio-text modalities
Improves fusion via word-level cross-modal attention
Enhances accuracy with prosodic pause token integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Word-level multimodal speech alignment strategy
Gated Cross-Attention Fusion mechanism
Prosodic cues integration via pause tokens
🔎 Similar Papers
2024-01-02IEEE International Conference on Bioinformatics and BiomedicineCitations: 0
David Ortiz-Perez
David Ortiz-Perez
PhD Student, University of Alicante
Deep LearningComputer VisionMultimodal
Manuel Benavent-Lledo
Manuel Benavent-Lledo
PhD Student, University of Alicante
Deep LearningComputer VisionAction Recognition
J
Javier Rodríguez-Juan
Department of Computer Science and Technology, University of Alicante, Alicante, Spain
J
Jose Garcia-Rodriguez
Department of Computer Science and Technology, University of Alicante, Alicante, Spain, Valencian Graduate School and Research Network of Artificial Intelligence, Valencia, Spain
D
David Tom'as
Department of Software and Computing Systems, University of Alicante, Alicante, Spain