🤖 AI Summary
This work addresses the challenges of geometric inconsistency and lack of photorealism in multi-view human pose image generation from a single input image. We propose a dual-conditional diffusion framework that jointly leverages 3D-aware neural rendering and parametric human priors: Human NeRF provides geometrically consistent coarse multi-view renderings; the SMPL model extracts texture, normal, and semantic features, which are fused hierarchically to jointly optimize global structure and local details; finally, a diffusion model performs high-fidelity image refinement. Our method significantly outperforms existing approaches under complex poses, loose clothing, and occlusion scenarios. It achieves joint novel-view and novel-pose synthesis with precise geometry, sharp details, and strong cross-view consistency. The framework establishes a robust and efficient paradigm for single-image human editing, enabling controllable, high-quality, and geometrically faithful human image generation.
📝 Abstract
The creation of lifelike human avatars capable of realistic pose variation and viewpoint flexibility remains a fundamental challenge in computer vision and graphics. Current approaches typically yield either geometrically inconsistent multi-view images or sacrifice photorealism, resulting in blurry outputs under diverse viewing angles and complex motions. To address these issues, we propose Blur2Sharp, a novel framework integrating 3D-aware neural rendering and diffusion models to generate sharp, geometrically consistent novel-view images from only a single reference view. Our method employs a dual-conditioning architecture: initially, a Human NeRF model generates geometrically coherent multi-view renderings for target poses, explicitly encoding 3D structural guidance. Subsequently, a diffusion model conditioned on these renderings refines the generated images, preserving fine-grained details and structural fidelity. We further enhance visual quality through hierarchical feature fusion, incorporating texture, normal, and semantic priors extracted from parametric SMPL models to simultaneously improve global coherence and local detail accuracy. Extensive experiments demonstrate that Blur2Sharp consistently surpasses state-of-the-art techniques in both novel pose and view generation tasks, particularly excelling under challenging scenarios involving loose clothing and occlusions.