SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing unsupervised multi-view 3D human pose estimation methods suffer from poor generalization and sensitivity to occlusion. This paper proposes the first ground-truth-free framework based on 3D Gaussian splatting: it models the human body as a differentiable, joint-level Gaussian point cloud, where each Gaussian is independently optimized via one-hot encoding; it jointly leverages differentiable rendering and multi-view geometric constraints to achieve cross-view pose reconstruction without any 3D annotations. To our knowledge, this is the first work to introduce Gaussian splatting into skeletal pose estimation—naturally supporting arbitrary camera configurations and significantly enhancing cross-dataset generalization. Evaluated on Human3.6M and CMU Panoptic, our method achieves up to 47.8% reduction in cross-domain error compared to prior unsupervised approaches, while maintaining robust accuracy under severe occlusion.

Technology Category

Application Category

📝 Abstract

Accurate 3D human pose estimation is fundamental for applications such as augmented reality and human-robot interaction. State-of-the-art multi-view methods learn to fuse predictions across views by training on large annotated datasets, leading to poor generalization when the test scenario differs. To overcome these limitations, we propose SkelSplat, a novel framework for multi-view 3D human pose estimation based on differentiable Gaussian rendering. Human pose is modeled as a skeleton of 3D Gaussians, one per joint, optimized via differentiable rendering to enable seamless fusion of arbitrary camera views without 3D ground-truth supervision. Since Gaussian Splatting was originally designed for dense scene reconstruction, we propose a novel one-hot encoding scheme that enables independent optimization of human joints. SkelSplat outperforms approaches that do not rely on 3D ground truth in Human3.6M and CMU, while reducing the cross-dataset error up to 47.8% compared to learning-based methods. Experiments on Human3.6M-Occ and Occlusion-Person demonstrate robustness to occlusions, without scenario-specific fine-tuning. Our project page is available here: https://skelsplat.github.io.

Problem

Research questions and friction points this paper is trying to address.

Overcoming poor generalization in multi-view 3D pose estimation

Enabling occlusion-robust 3D human pose reconstruction without 3D supervision

Achieving cross-dataset generalization without scenario-specific fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable Gaussian rendering for pose estimation

One-hot encoding for independent joint optimization

No 3D ground-truth supervision required

🔎 Similar Papers

GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers