π€ AI Summary
To address the need for non-invasive, early screening of cognitive decline (e.g., in Alzheimerβs disease), this study systematically reviews and empirically evaluates deep learning methods leveraging speech, text, and visual modalities. Methodologically, it introduces a novel multimodal framework integrating Transformer architectures with foundation model fine-tuning, rigorously validated across established benchmarks. Key contributions include: (1) the first empirical demonstration that textual modality exhibits significant dominance in cognitive decline detection; (2) consistent outperformance of the proposed multimodal approach over unimodal baselines across diverse datasets; and (3) a unified experimental analysis benchmarking major datasets and quantitative metrics, confirming that multimodal fusion enhances robustness and generalizability. Collectively, these findings establish a reproducible methodological foundation and technical pathway toward non-invasive, automated, and ecologically valid cognitive health monitoring in everyday settings.
π Abstract
Cognitive decline is a natural part of aging, often resulting in reduced cognitive abilities. In some cases, however, this decline is more pronounced, typically due to disorders such as Alzheimer's disease. Early detection of anomalous cognitive decline is crucial, as it can facilitate timely professional intervention. While medical data can help in this detection, it often involves invasive procedures. An alternative approach is to employ non-intrusive techniques such as speech or handwriting analysis, which do not necessarily affect daily activities. This survey reviews the most relevant methodologies that use deep learning techniques to automate the cognitive decline estimation task, including audio, text, and visual processing. We discuss the key features and advantages of each modality and methodology, including state-of-the-art approaches like Transformer architecture and foundation models. In addition, we present works that integrate different modalities to develop multimodal models. We also highlight the most significant datasets and the quantitative results from studies using these resources. From this review, several conclusions emerge. In most cases, the textual modality achieves the best results and is the most relevant for detecting cognitive decline. Moreover, combining various approaches from individual modalities into a multimodal model consistently enhances performance across nearly all scenarios.