MVD-HuGaS: Human Gaussians from a Single Image via 3D Human Multi-view Diffusion Prior

📅 2025-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Single-image-driven 3D human reconstruction suffers from geometric distortions (e.g., flattening), multi-view inconsistency, and poor generalization. To address these challenges, we propose the first multi-view diffusion model that jointly incorporates 3D human geometric and structural priors to synthesize geometrically consistent multi-view images. Our method introduces a camera pose alignment module and a depth-guided facial distortion mitigation mechanism, enabling joint optimization of 3D Gaussian splatting and camera parameters. Furthermore, we integrate Score Distillation Sampling for end-to-end differentiable reconstruction. Evaluated on THuman2.0 and 2K2K benchmarks, our approach achieves state-of-the-art performance: it significantly suppresses artifacts, enables high-fidelity novel-view synthesis, and substantially improves reconstruction fidelity and cross-scene generalization capability.

Technology Category

Application Category

📝 Abstract
3D human reconstruction from a single image is a challenging problem and has been exclusively studied in the literature. Recently, some methods have resorted to diffusion models for guidance, optimizing a 3D representation via Score Distillation Sampling(SDS) or generating one back-view image for facilitating reconstruction. However, these methods tend to produce unsatisfactory artifacts ( extit{e.g.} flattened human structure or over-smoothing results caused by inconsistent priors from multiple views) and struggle with real-world generalization in the wild. In this work, we present emph{MVD-HuGaS}, enabling free-view 3D human rendering from a single image via a multi-view human diffusion model. We first generate multi-view images from the single reference image with an enhanced multi-view diffusion model, which is well fine-tuned on high-quality 3D human datasets to incorporate 3D geometry priors and human structure priors. To infer accurate camera poses from the sparse generated multi-view images for reconstruction, an alignment module is introduced to facilitate joint optimization of 3D Gaussians and camera poses. Furthermore, we propose a depth-based Facial Distortion Mitigation module to refine the generated facial regions, thereby improving the overall fidelity of the reconstruction.Finally, leveraging the refined multi-view images, along with their accurate camera poses, MVD-HuGaS optimizes the 3D Gaussians of the target human for high-fidelity free-view renderings. Extensive experiments on Thuman2.0 and 2K2K datasets show that the proposed MVD-HuGaS achieves state-of-the-art performance on single-view 3D human rendering.
Problem

Research questions and friction points this paper is trying to address.

3D human reconstruction from single image challenges
Artifacts in human structure and over-smoothing issues
Real-world generalization difficulties in 3D rendering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced multi-view diffusion model for image generation
Alignment module for joint optimization of 3D Gaussians
Depth-based Facial Distortion Mitigation module for refinement
🔎 Similar Papers
No similar papers found.