🤖 AI Summary
Existing diffusion-based face-swapping methods suffer from inadequate identity preservation and visible artifacts under complex poses and expressions, primarily due to the absence of explicit 3D facial structural modeling. To address this, we propose the first diffusion-based face-swapping framework incorporating 3D facial latent features. Our method jointly conditions the denoising process on 3D morphable parameters, identity embeddings, and facial landmarks—enabling effective disentanglement of identity, pose, and expression. Generation is driven by a 3D-aware representation, and we introduce a dual-modality evaluation protocol combining biometric identification metrics with human perceptual studies. Extensive experiments on CelebA, FFHQ, and CelebV-Text demonstrate significant improvements over state-of-the-art methods, achieving both high-fidelity visual quality and strong identity consistency.
📝 Abstract
Diffusion-based approaches have recently achieved strong results in face swapping, offering improved visual quality over traditional GAN-based methods. However, even state-of-the-art models often suffer from fine-grained artifacts and poor identity preservation, particularly under challenging poses and expressions. A key limitation of existing approaches is their failure to meaningfully leverage 3D facial structure, which is crucial for disentangling identity from pose and expression. In this work, we propose DiffSwap++, a novel diffusion-based face-swapping pipeline that incorporates 3D facial latent features during training. By guiding the generation process with 3D-aware representations, our method enhances geometric consistency and improves the disentanglement of facial identity from appearance attributes. We further design a diffusion architecture that conditions the denoising process on both identity embeddings and facial landmarks, enabling high-fidelity and identity-preserving face swaps. Extensive experiments on CelebA, FFHQ, and CelebV-Text demonstrate that DiffSwap++ outperforms prior methods in preserving source identity while maintaining target pose and expression. Additionally, we introduce a biometric-style evaluation and conduct a user study to further validate the realism and effectiveness of our approach. Code will be made publicly available at https://github.com/WestonBond/DiffSwapPP