IM-Portrait: Learning 3D-aware Video Diffusion for PhotorealisticTalking Heads from Monocular Videos

📅 2025-04-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper addresses the challenge of generating high-fidelity, geometrically consistent, and 3D-aware talking-head videos from a single input portrait image—without post-processing. To this end, we propose an end-to-end monocular talking-head synthesis method that eschews explicit 3D representations (e.g., NeRF or 3D Gaussians) and instead introduces the first diffusion-based, MPI (Multi-Plane Image)-guided 3D-aware generative model. We design a camera-space stochastic reconstruction training strategy, enabling joint learning of fine-grained texture and implicit 3D structure without multi-view data or explicit 3D supervision. Geometric consistency is further enforced via monocular video self-supervision. Experiments demonstrate that our method achieves state-of-the-art performance in both visual quality and novel-view synthesis, while generating results in a single inference step with zero post-processing. Moreover, its inherent MPI representation naturally supports stereo VR rendering.

Technology Category

Application Category

📝 Abstract

We propose a novel 3D-aware diffusion-based method for generating photorealistic talking head videos directly from a single identity image and explicit control signals (e.g., expressions). Our method generates Multiplane Images (MPIs) that ensure geometric consistency, making them ideal for immersive viewing experiences like binocular videos for VR headsets. Unlike existing methods that often require a separate stage or joint optimization to reconstruct a 3D representation (such as NeRF or 3D Gaussians), our approach directly generates the final output through a single denoising process, eliminating the need for post-processing steps to render novel views efficiently. To effectively learn from monocular videos, we introduce a training mechanism that reconstructs the output MPI randomly in either the target or the reference camera space. This approach enables the model to simultaneously learn sharp image details and underlying 3D information. Extensive experiments demonstrate the effectiveness of our method, which achieves competitive avatar quality and novel-view rendering capabilities, even without explicit 3D reconstruction or high-quality multi-view training data.

Problem

Research questions and friction points this paper is trying to address.

Generating photorealistic talking heads from single images

Ensuring 3D geometric consistency without explicit reconstruction

Learning 3D-aware video diffusion from monocular videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D-aware diffusion for photorealistic talking heads

Generates Multiplane Images ensuring geometric consistency

Single denoising process eliminates post-processing steps

🔎 Similar Papers

No similar papers found.

Authors to Follow