Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In collaborative scenarios, temporal action segmentation is highly susceptible to noise from pose estimation and object detection, leading to over-segmentation and temporal discontinuities. To address this, we propose a multimodal temporal graph fusion framework. Methodologically: (1) sinusoidal encoding is introduced to enhance skeletal spatial representation; (2) a temporal graph fusion module aligns low-frame-rate visual features with high-frame-rate motion data; (3) SmoothLabelMix—a novel data augmentation strategy—is proposed to generate synthetic samples with smoothed action boundaries; (4) a multimodal graph convolutional network integrates skeletal, object detection, and visual cues, incorporating hierarchical feature aggregation and temporal consistency constraints. Evaluated on the Bimanual Actions Dataset, our method achieves state-of-the-art performance (F1@10 = 94.5%, F1@25 = 92.8%), significantly mitigating over-segmentation and improving robustness in action boundary localization.

Technology Category

Application Category

📝 Abstract
Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settings, where a precise understanding of sub-activity labels and their temporal structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of action sequences. To address this, we propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data (skeleton and object detections) to mitigate fragmentation. Our framework introduces three key contributions. First, a sinusoidal encoding strategy that maps 3D skeleton coordinates into a continuous sin-cos space to enhance spatial representation robustness. Second, a temporal graph fusion module that aligns multi-modal inputs with differing resolutions via hierarchical feature aggregation, Third, inspired by the smooth transitions inherent to human actions, we design SmoothLabelMix, a data augmentation technique that mixes input sequences and labels to generate synthetic training examples with gradual action transitions, enhancing temporal consistency in predictions and reducing over-segmentation artifacts. Extensive experiments on the Bimanual Actions Dataset, a public benchmark for human-object interaction understanding, demonstrate that our approach outperforms state-of-the-art methods, especially in action segmentation accuracy, achieving F1@10: 94.5% and F1@25: 92.8%.
Problem

Research questions and friction points this paper is trying to address.

Mitigates over-segmentation errors in human action sequences
Integrates multi-modal data for robust action segmentation
Enhances temporal consistency with smooth action transitions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Modal Graph Convolutional Network for action segmentation
Sinusoidal encoding enhances spatial representation robustness
SmoothLabelMix reduces over-segmentation via gradual transitions
🔎 Similar Papers
No similar papers found.
H
Hao Xing
Institute for Cognitive Systems, Technical University of Munich
K
Kai Zhe Boey
Institute for Cognitive Systems, Technical University of Munich
Y
Yuankai Wu
Chair of Media Technology, Technical University of Munich
Darius Burschka
Darius Burschka
Professor of Computer Engineering, Technical University of Munich
Image ProcessingComputer VisionStructure from MotionHuman-Computer Interaction3D
Gordon Cheng
Gordon Cheng
Technical University of Munich
NeuroRoboticsNeuroEngineeringImitation LearningCognitive SystemsHumanoid Robotics