🤖 AI Summary
Existing 3D human reconstruction methods suffer from scarce training assets, ambiguous mapping—especially under occlusion and in invisible regions—error propagation across cascaded stages, and bottlenecks in both reconstruction quality and computational efficiency. This paper introduces an implicit-space Gaussian generative paradigm: multi-view images are compressed into Gaussian representations; a UV-structured VAE jointly encodes geometry and texture into a unified latent space; and a DiT-based architecture enables conditional end-to-end generation, framing the ill-posed low-dimensional-to-high-dimensional mapping as learnable distributional translation. Key contributions include: (1) the first implicit-space Gaussian generation framework for 3D human reconstruction; (2) HGS-1M—the first million-scale 3D human Gaussian asset dataset; and (3) high-fidelity reconstruction capabilities, accurately capturing facial details, complex textures, and dynamic deformations of loose clothing, while supporting real-time inference and scalable training.
📝 Abstract
3D human digitization has long been a highly pursued yet challenging task. Existing methods aim to generate high-quality 3D digital humans from single or multiple views, but remain primarily constrained by current paradigms and the scarcity of 3D human assets. Specifically, recent approaches fall into several paradigms: optimization-based and feed-forward (both single-view regression and multi-view generation with reconstruction). However, they are limited by slow speed, low quality, cascade reasoning, and ambiguity in mapping low-dimensional planes to high-dimensional space due to occlusion and invisibility, respectively. Furthermore, existing 3D human assets remain small-scale, insufficient for large-scale training. To address these challenges, we propose a latent space generation paradigm for 3D human digitization, which involves compressing multi-view images into Gaussians via a UV-structured VAE, along with DiT-based conditional generation, we transform the ill-posed low-to-high-dimensional mapping problem into a learnable distribution shift, which also supports end-to-end inference. In addition, we employ the multi-view optimization approach combined with synthetic data to construct the HGS-1M dataset, which contains $1$ million 3D Gaussian assets to support the large-scale training. Experimental results demonstrate that our paradigm, powered by large-scale training, produces high-quality 3D human Gaussians with intricate textures, facial details, and loose clothing deformation.