AG-EgoPose: Leveraging Action-Guided Motion and Kinematic Joint Encoding for Egocentric 3D Pose Estimation

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of 3D human pose estimation from egocentric views, where severe perspective distortion, limited body visibility, and complex camera motion hinder performance. To tackle these issues, the authors propose a dual-stream Transformer framework: a spatial stream based on ResNet-18 generates 2D joint heatmaps and learnable joint tokens, while a temporal stream leveraging ResNet-50 and action recognition models captures both short- and long-term motion dynamics. These streams are jointly optimized within a Transformer decoder, enabling, for the first time, unified encoding of action-guided motion modeling and joint-level spatial features. This integration effectively fuses spatiotemporal information while preserving anatomical constraints. The method achieves state-of-the-art performance on real-world fisheye video datasets, significantly outperforming existing approaches.

Technology Category

Application Category

📝 Abstract
Egocentric 3D human pose estimation remains challenging due to severe perspective distortion, limited body visibility, and complex camera motion inherent in first-person viewpoints. Existing methods typically rely on single-frame analysis or limited temporal fusion, which fails to effectively leverage the rich motion context available in egocentric videos. We introduce AG-EgoPose, a novel dual-stream framework that integrates short- and long-range motion context with fine-grained spatial cues for robust pose estimation from fisheye camera input. Our framework features two parallel streams: A spatial stream uses a weight-sharing ResNet-18 encoder-decoder to generate 2D joint heatmaps and corresponding joint-specific spatial feature tokens. Simultaneously, a temporal stream uses a ResNet-50 backbone to extract visual features, which are then processed by an action recognition backbone to capture the motion dynamics. These complementary representations are fused and refined in a transformer decoder with learnable joint tokens, which allows for the joint-level integration of spatial and temporal evidence while maintaining anatomical constraints. Experiments on real-world datasets demonstrate that AG-EgoPose achieves state-of-the-art performance in both quantitative and qualitative metrics. Code is available at: https://github.com/Mushfiq5647/AG-EgoPose.
Problem

Research questions and friction points this paper is trying to address.

egocentric 3D pose estimation
perspective distortion
body visibility
camera motion
motion context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Egocentric Pose Estimation
Action-Guided Motion
Kinematic Joint Encoding
Dual-Stream Framework
Transformer-based Fusion
🔎 Similar Papers
No similar papers found.