🤖 AI Summary
Existing face-swapping methods often compromise target identity preservation and facial expression naturalness, leading to temporal inconsistency in videos. This paper proposes a high-fidelity video face replacement framework. First, it constructs a disentangled 4D (identity/expression/pose/geometry) 3D facial prior to enable fine-grained conditional control. Second, it introduces a collaborative FaceFormer–ReferenceNet architecture that decouples high-level identity injection from low-level detail reconstruction. Third, it incorporates a plug-and-play temporal attention mechanism to ensure inter-frame consistency over long video sequences. By integrating diffusion models with 3D Morphable Model (3DMM) priors, the method supports end-to-end video generation. Evaluated on FF++, it achieves state-of-the-art performance: identity similarity improves by 12.6%, expression error decreases by 31.4%, and FID drops by 2.8—demonstrating significant gains in generation stability and fidelity.
📝 Abstract
Face swapping transfers the identity of a source face to a target face while retaining the attributes like expression, pose, hair, and background of the target face. Advanced face swapping methods have achieved attractive results. However, these methods often inadvertently transfer identity information from the target face, compromising expression-related details and accurate identity. We propose a novel method DynamicFace that leverages the power of diffusion model and plug-and-play temporal layers for video face swapping. First, we introduce four fine-grained face conditions using 3D facial priors. All conditions are designed to be disentangled from each other for precise and unique control. Then, we adopt Face Former and ReferenceNet for high-level and detailed identity injection. Through experiments on the FF++ dataset, we demonstrate that our method achieves state-of-the-art results in face swapping, showcasing superior image quality, identity preservation, and expression accuracy. Besides, our method could be easily transferred to video domain with temporal attention layer. Our code and results will be available on the project page: https://dynamic-face.github.io/