HART: Human Aligned Reconstruction Transformer

πŸ“… 2025-09-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses 3D human reconstruction from sparse, uncalibrated RGB images. We propose the first unified framework integrating parametric human modeling (SMPL-X) with non-rigid clothing deformation modeling. Our method innovatively employs a Transformer architecture to predict pixel-wise 3D point clouds, surface normals, and body-part semantic correspondences. To recover self-occluded geometry, we incorporate occlusion-aware Poisson surface reconstruction and jointly optimize a watertight dressed mesh, an SMPL-X-aligned body model, and a Gaussian splatting rendering representation. To our knowledge, this is the first work to employ Transformers for end-to-end human alignment and dense geometric reconstruction. Extensive experiments demonstrate significant improvements across multiple benchmarks: Chamfer distance of dressed meshes decreases by 18–23%, SMPL-X PA-V2V error reduces by 6–27%, and novel-view synthesis LPIPS improves by 15–27%.

Technology Category

Application Category

πŸ“ Abstract
We introduce HART, a unified framework for sparse-view human reconstruction. Given a small set of uncalibrated RGB images of a person as input, it outputs a watertight clothed mesh, the aligned SMPL-X body mesh, and a Gaussian-splat representation for photorealistic novel-view rendering. Prior methods for clothed human reconstruction either optimize parametric templates, which overlook loose garments and human-object interactions, or train implicit functions under simplified camera assumptions, limiting applicability in real scenes. In contrast, HART predicts per-pixel 3D point maps, normals, and body correspondences, and employs an occlusion-aware Poisson reconstruction to recover complete geometry, even in self-occluded regions. These predictions also align with a parametric SMPL-X body model, ensuring that reconstructed geometry remains consistent with human structure while capturing loose clothing and interactions. These human-aligned meshes initialize Gaussian splats to further enable sparse-view rendering. While trained on only 2.3K synthetic scans, HART achieves state-of-the-art results: Chamfer Distance improves by 18-23 percent for clothed-mesh reconstruction, PA-V2V drops by 6-27 percent for SMPL-X estimation, LPIPS decreases by 15-27 percent for novel-view synthesis on a wide range of datasets. These results suggest that feed-forward transformers can serve as a scalable model for robust human reconstruction in real-world settings. Code and models will be released.
Problem

Research questions and friction points this paper is trying to address.

Reconstructs clothed human geometry from sparse uncalibrated RGB images
Aligns reconstructed mesh with parametric body model for structural consistency
Enables photorealistic novel-view rendering through Gaussian splat representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts 3D point maps and normals per pixel
Uses occlusion-aware Poisson reconstruction for geometry
Aligns reconstruction with parametric SMPL-X body model
πŸ”Ž Similar Papers
No similar papers found.
X
Xiyi Chen
University of Maryland, College Park
S
Shaofei Wang
State Key Laboratory of General Artificial Intelligence, BIGAI
Marko Mihajlovic
Marko Mihajlovic
PhD Student, ETH Zurich
Machine LearningComputer VisionComputer Graphics
T
Taewon Kang
University of Maryland, College Park
Sergey Prokudin
Sergey Prokudin
ETH ZΓΌrich
computer visionmachine learning
M
Ming Lin
University of Maryland, College Park