🤖 AI Summary
Existing methods struggle to accurately recognize sub-second micro-movements while preserving the contextual integrity of rehabilitation exercises, leading to ambiguous action phase boundaries and compromising the reliability of motor function assessment in stroke patients. To address this challenge, this work proposes a high-resolution temporal Transformer architecture that introduces a novel multi-member temporal attention mechanism, enabling each frame to concurrently attend to multiple local temporal contexts. By integrating feature-space overlap resolution with single-stage multimodal fusion of video and inertial measurement unit (IMU) data, the model significantly enhances boundary sensitivity without increasing architectural depth or relying on post-processing. Experiments on the StrokeRehab and 50Salads datasets demonstrate Edit Score improvements of 1.3–1.6 and 3.3, respectively, and ablation studies confirm that the performance gains stem from the multi-member temporal modeling rather than increased structural complexity.
📝 Abstract
To empower the iterative assessments involved during a person's rehabilitation, automated assessment of a person's abilities during daily activities requires temporally precise segmentation of fine-grained actions in therapy videos. Existing temporal action segmentation (TAS) models struggle to capture sub-second micro-movements while retaining exercise context, blurring rapid phase transitions and limiting reliable downstream assessment of motor recovery. We introduce Multi-Membership Temporal Attention (MMTA), a high-resolution temporal transformer for fine-grained rehabilitation assessment. Unlike standard temporal attention, which assigns each frame a single attention context per layer, MMTA lets each frame attend to multiple locally normalized temporal attention windows within the same layer. We fuse these concurrent temporal views via feature-space overlap resolution, preserving competing local contexts near transitions while enabling longer-range reasoning through layer-wise propagation. This increases boundary sensitivity without additional depth or multi-stage refinement. MMTA supports both video and wearable IMU inputs within a unified single-stage architecture, making it applicable to both clinical and home settings. MMTA consistently improves over the Global Attention transformer, boosting Edit Score by +1.3 (Video) and +1.6 (IMU) on StrokeRehab while further improving 50Salads by +3.3. Ablations confirm that performance gains stem from multi-membership temporal views rather than architectural complexity, offering a practical solution for resource-constrained rehabilitation assessment.