🤖 AI Summary
To address critical challenges in single-image face editing—including identity collapse, misaligned hairlines, background distortion, and contextual inconsistency (e.g., hairstyle and accessories)—this paper proposes a parameter-free, 3DMM-guided diffusion framework leveraging multi-view geometric priors. Methodologically: (i) we introduce the first learnable-parameter-free 3D Morphable Model (3DMM) conditioning mechanism, enabling geometry-aware cross-view consistency; (ii) we design a joint embedding module integrating ArcFace-based identity feature distillation and CLIP-driven cross-modal semantic alignment to jointly constrain identity, hairstyle, accessories, and background; (iii) the framework supports fine-grained control over pose, expression, and illumination. Quantitatively, it achieves state-of-the-art performance with +12.6% identity preservation rate, −23.4 FID improvement, and superior edit controllability—demonstrating high-fidelity, contextually consistent editing from a single input image.
📝 Abstract
Facial appearance editing is crucial for digital avatars, AR/VR, and personalized content creation, driving realistic user experiences. However, preserving identity with generative models is challenging, especially in scenarios with limited data availability. Traditional methods often require multiple images and still struggle with unnatural face shifts, inconsistent hair alignment, or excessive smoothing effects. To overcome these challenges, we introduce a novel diffusion-based framework, InstaFace, to generate realistic images while preserving identity using only a single image. Central to InstaFace, we introduce an efficient guidance network that harnesses 3D perspectives by integrating multiple 3DMM-based conditionals without introducing additional trainable parameters. Moreover, to ensure maximum identity retention as well as preservation of background, hair, and other contextual features like accessories, we introduce a novel module that utilizes feature embeddings from a facial recognition model and a pre-trained vision-language model. Quantitative evaluations demonstrate that our method outperforms several state-of-the-art approaches in terms of identity preservation, photorealism, and effective control of pose, expression, and lighting.