SIGMAN:Scaling 3D Human Gaussian Generation with Millions of Assets

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D human reconstruction methods suffer from scarce training assets, ambiguous mapping—especially under occlusion and in invisible regions—error propagation across cascaded stages, and bottlenecks in both reconstruction quality and computational efficiency. This paper introduces an implicit-space Gaussian generative paradigm: multi-view images are compressed into Gaussian representations; a UV-structured VAE jointly encodes geometry and texture into a unified latent space; and a DiT-based architecture enables conditional end-to-end generation, framing the ill-posed low-dimensional-to-high-dimensional mapping as learnable distributional translation. Key contributions include: (1) the first implicit-space Gaussian generation framework for 3D human reconstruction; (2) HGS-1M—the first million-scale 3D human Gaussian asset dataset; and (3) high-fidelity reconstruction capabilities, accurately capturing facial details, complex textures, and dynamic deformations of loose clothing, while supporting real-time inference and scalable training.

Technology Category

Application Category

📝 Abstract
3D human digitization has long been a highly pursued yet challenging task. Existing methods aim to generate high-quality 3D digital humans from single or multiple views, but remain primarily constrained by current paradigms and the scarcity of 3D human assets. Specifically, recent approaches fall into several paradigms: optimization-based and feed-forward (both single-view regression and multi-view generation with reconstruction). However, they are limited by slow speed, low quality, cascade reasoning, and ambiguity in mapping low-dimensional planes to high-dimensional space due to occlusion and invisibility, respectively. Furthermore, existing 3D human assets remain small-scale, insufficient for large-scale training. To address these challenges, we propose a latent space generation paradigm for 3D human digitization, which involves compressing multi-view images into Gaussians via a UV-structured VAE, along with DiT-based conditional generation, we transform the ill-posed low-to-high-dimensional mapping problem into a learnable distribution shift, which also supports end-to-end inference. In addition, we employ the multi-view optimization approach combined with synthetic data to construct the HGS-1M dataset, which contains $1$ million 3D Gaussian assets to support the large-scale training. Experimental results demonstrate that our paradigm, powered by large-scale training, produces high-quality 3D human Gaussians with intricate textures, facial details, and loose clothing deformation.
Problem

Research questions and friction points this paper is trying to address.

Overcoming slow speed and low quality in 3D human digitization
Addressing scarcity of large-scale 3D human training assets
Transforming ill-posed low-to-high-dimensional mapping into learnable distribution shift
Innovation

Methods, ideas, or system contributions that make the work stand out.

UV-structured VAE compresses images into Gaussians
DiT-based conditional generation enables distribution shift
HGS-1M dataset with 1M assets for training
🔎 Similar Papers
No similar papers found.