🤖 AI Summary
Existing wearable IMU-based human activity recognition (HAR) methods for movement disorders (e.g., Parkinson’s disease) suffer from poor out-of-distribution (OOD) generalization and heavy reliance on labeled data.
Method: We propose an IMU-video cross-modal self-supervised pretraining framework that leverages large-scale unlabeled multimodal data. It employs cross-modal contrastive learning and spatiotemporal alignment to learn robust, disentangled motion representations—without requiring task-specific annotations.
Contribution/Results: The framework significantly improves generalization to unseen environments and populations. Experiments demonstrate superior zero-shot and few-shot transfer performance on multiple OOD IMU benchmarks, outperforming both IMU-only and existing IMU-video pretraining approaches. By enabling accurate, continuous, and low-burden detection of abnormal movements from remote monitoring data, our method establishes a new paradigm for scalable, real-world deployment in digital health.
📝 Abstract
Human Activity Recognition (HAR) based on wearable inertial sensors plays a critical role in remote health monitoring. In patients with movement disorders, the ability to detect abnormal patient movements in their home environments can enable continuous optimization of treatments and help alert caretakers as needed. Machine learning approaches have been proposed for HAR tasks using Inertial Measurement Unit (IMU) data; however, most rely on application-specific labels and lack generalizability to data collected in different environments or populations. To address this limitation, we propose a new cross-modal self-supervised pretraining approach to learn representations from large-sale unlabeled IMU-video data and demonstrate improved generalizability in HAR tasks on out of distribution (OOD) IMU datasets, including a dataset collected from patients with Parkinson's disease. Specifically, our results indicate that the proposed cross-modal pretraining approach outperforms the current state-of-the-art IMU-video pretraining approach and IMU-only pretraining under zero-shot and few-shot evaluations. Broadly, our study provides evidence that in highly dynamic data modalities, such as IMU signals, cross-modal pretraining may be a useful tool to learn generalizable data representations. Our software is available at https://github.com/scheshmi/IMU-Video-OOD-HAR.