Learning Pyramid-structured Long-range Dependencies for 3D Human Pose Estimation

📅 2025-06-03
🏛️ IEEE transactions on multimedia
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Modeling long-range dependencies in 3D human pose estimation remains challenging due to noise susceptibility and high model complexity. To address this, we propose the Pyramid Graph Attention (PGA) module and a lightweight Multi-Scale Graph Transformer (PGFormer). Our core contribution is the first formulation of human anatomical substructures—joints, limbs, and torso—as a pyramid-shaped, cross-scale graph, coupled with a pooling-augmented self-attention mechanism that preserves structural priors while enabling multi-granularity feature interaction. By integrating graph convolutional operations with multi-scale feature fusion, our method effectively suppresses redundancy in deep networks. Evaluated on Human3.6M and MPI-INF-3DHP, it achieves state-of-the-art accuracy (MPJPE: 41.2 mm and 89.7 mm, respectively) with a 23% reduction in parameter count, demonstrating both the efficacy and efficiency of cross-scale graph modeling for 3D pose estimation.

Technology Category

Application Category

📝 Abstract
Action coordination in human structure is indispensable for the spatial constraints of 2D joints to recover 3D pose. Usually, action coordination is represented as a long-range dependence among body parts. However, there are two main challenges in modeling long-range dependencies. First, joints should not only be constrained by other individual joints but also be modulated by the body parts. Second, existing methods make networks deeper to learn dependencies between non-linked parts. They introduce uncorrelated noise and increase the model size. In this paper, we utilize a pyramid structure to better learn potential long-range dependencies. It can capture the correlation across joints and groups, which complements the context of the human sub-structure. In an effective cross-scale way, it captures the pyramid-structured long-range dependence. Specifically, we propose a novel Pyramid Graph Attention (PGA) module to capture long-range cross-scale dependencies. It concatenates information from various scales into a compact sequence, and then computes the correlation between scales in parallel. Combining PGA with graph convolution modules, we develop a Pyramid Graph Transformer (PGFormer) for 3D human pose estimation, which is a lightweight multi-scale transformer architecture. It encapsulates human sub-structures into self-attention by pooling. Extensive experiments show that our approach achieves lower error and smaller model size than state-of-the-art methods on Human3.6M and MPI-INF-3DHP datasets. The code is available at https://github.com/MingjieWe/PGFormer.
Problem

Research questions and friction points this paper is trying to address.

Modeling long-range dependencies in 3D human pose estimation
Reducing uncorrelated noise and model size in deep networks
Capturing cross-scale correlations for human sub-structure context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pyramid structure captures long-range dependencies
Pyramid Graph Attention computes cross-scale correlations
Lightweight multi-scale transformer with self-attention pooling
🔎 Similar Papers
No similar papers found.
Mingjie Wei
Mingjie Wei
xidian university
3D HumanMotion generation3D human pose estimation
X
Xuemei Xie
School of Artificial Intelligence, Xidian University, and also with Pazhou LAB (Huangpu), Guangzhou 510555, China
Y
Yutong Zhong
School of Artificial Intelligence, Xidian University, Xi’an 710071, China
G
G. Shi
School of Artificial Intelligence, Xidian University, and also with Peng Cheng Laboratory, Shenzhen 518066, China