Optimizing Local-Global Dependencies for Accurate 3D Human Pose Estimation

📅 2024-12-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

216K/year
🤖 AI Summary
To address the insufficient modeling of local bone-level details and the disconnection of global joint dependencies in 3D human pose estimation, this paper proposes SSR-STF, a dual-stream architecture that jointly models skeletal-level local structure and long-range inter-joint relationships. Its core innovation is the Skeleton-Selective Refinement Attention (SSRA) mechanism, which adaptively fuses fine-grained local anatomy with global semantic context, overcoming the limitations of single-stream Transformers in capturing subtle joint motions. The method integrates a Transformer backbone, a customized SSRFormer module, multi-scale spatiotemporal feature alignment, and dynamic feature fusion. On Human3.6M and MPI-INF-3DHP, it achieves state-of-the-art P1 errors of 37.4 mm and 13.2 mm, respectively. Moreover, when transferred to 3D mesh recovery, it delivers significant performance gains.

Technology Category

Application Category

📝 Abstract
Transformer-based methods have recently achieved significant success in 3D human pose estimation, owing to their strong ability to model long-range dependencies. However, relying solely on the global attention mechanism is insufficient for capturing the fine-grained local details, which are crucial for accurate pose estimation. To address this, we propose SSR-STF, a dual-stream model that effectively integrates local features with global dependencies to enhance 3D human pose estimation. Specifically, we introduce SSRFormer, a simple yet effective module that employs the skeleton selective refine attention (SSRA) mechanism to capture fine-grained local dependencies in human pose sequences, complementing the global dependencies modeled by the Transformer. By adaptively fusing these two feature streams, SSR-STF can better learn the underlying structure of human poses, overcoming the limitations of traditional methods in local feature extraction. Extensive experiments on the Human3.6M and MPI-INF-3DHP datasets demonstrate that SSR-STF achieves state-of-the-art performance, with P1 errors of 37.4 mm and 13.2 mm respectively, outperforming existing methods in both accuracy and generalization. Furthermore, the motion representations learned by our model prove effective in downstream tasks such as human mesh recovery. Codes are available at https://github.com/poker-xu/SSR-STF.
Problem

Research questions and friction points this paper is trying to address.

3D Human Pose Estimation
Accuracy Improvement
Detail Processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

SSR-STF Model
Dual-Channel Approach
SSRFormer Module
🔎 Similar Papers
G
Guangsheng Xu
School of Aeronautics and Astronautics, Sun Yat-sen University, Shenzhen 518107, Guangdong, China
G
Guoyi Zhang
School of Aeronautics and Astronautics, Sun Yat-sen University, Shenzhen 518107, Guangdong, China
L
Lejia Ye
School of Aeronautics and Astronautics, Sun Yat-sen University, Shenzhen 518107, Guangdong, China
S
Shuwei Gan
School of Aeronautics and Astronautics, Sun Yat-sen University, Shenzhen 518107, Guangdong, China
Xiaohu Zhang
Xiaohu Zhang
The University of Hong Kong
Urban TechnologyTransport Geography
Xia Yang
Xia Yang
Professor, Integrative Biology and Physiology, Molecular and Medical Pharmacology, UCLA
Integrative multiomicssystems biologycomplex diseasescardiometabolic diseasesbrain disorders