🤖 AI Summary
In collaborative scenarios, temporal action segmentation is highly susceptible to noise from pose estimation and object detection, leading to over-segmentation and temporal discontinuities. To address this, we propose a multimodal temporal graph fusion framework. Methodologically: (1) sinusoidal encoding is introduced to enhance skeletal spatial representation; (2) a temporal graph fusion module aligns low-frame-rate visual features with high-frame-rate motion data; (3) SmoothLabelMix—a novel data augmentation strategy—is proposed to generate synthetic samples with smoothed action boundaries; (4) a multimodal graph convolutional network integrates skeletal, object detection, and visual cues, incorporating hierarchical feature aggregation and temporal consistency constraints. Evaluated on the Bimanual Actions Dataset, our method achieves state-of-the-art performance (F1@10 = 94.5%, F1@25 = 92.8%), significantly mitigating over-segmentation and improving robustness in action boundary localization.
📝 Abstract
Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settings, where a precise understanding of sub-activity labels and their temporal structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of action sequences. To address this, we propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data (skeleton and object detections) to mitigate fragmentation. Our framework introduces three key contributions. First, a sinusoidal encoding strategy that maps 3D skeleton coordinates into a continuous sin-cos space to enhance spatial representation robustness. Second, a temporal graph fusion module that aligns multi-modal inputs with differing resolutions via hierarchical feature aggregation, Third, inspired by the smooth transitions inherent to human actions, we design SmoothLabelMix, a data augmentation technique that mixes input sequences and labels to generate synthetic training examples with gradual action transitions, enhancing temporal consistency in predictions and reducing over-segmentation artifacts.
Extensive experiments on the Bimanual Actions Dataset, a public benchmark for human-object interaction understanding, demonstrate that our approach outperforms state-of-the-art methods, especially in action segmentation accuracy, achieving F1@10: 94.5% and F1@25: 92.8%.