CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face Swapping

📅 2026-04-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

245K/year
🤖 AI Summary
Existing face-swapping methods struggle to simultaneously preserve identity and achieve visual realism under significant pose and expression variations. This work proposes a multimodal-guided face-swapping framework that, for the first time, leverages diffusion models for this task. By precomputing identity embeddings and employing a hierarchical cross-attention mechanism, the method integrates multiple signals—including identity features, gaze direction, and facial parsing maps—to enable spatially adaptive identity alignment and fine-grained regional control during the denoising process. The approach overcomes the mode collapse and limited controllability inherent in GAN-based methods, achieving a state-of-the-art FID score of 11.73 and significantly outperforming leading approaches such as FaceShifter and MegaFS, particularly in preserving identity and generating high-quality results across diverse head poses.

Technology Category

Application Category

📝 Abstract
Face swapping aims to optimize realistic facial image generation by leveraging the identity of a source face onto a target face while preserving pose, expression, and context. However, existing methods, especially GAN-based methods, often struggle to balance identity preservation and visual realism due to limited controllability and mode collapse. In this paper, we introduce CA-IDD (Cross-Attention Guided Identity-Conditional Diffusion), the first diffusion-based face swapping approach that integrates multi-modal guidance comprising gaze, identity, and facial parsing through multi-scale cross-attention. Precomputed identity embeddings are incorporated into the denoising process via hierarchical attention layers, resulting in accurate and consistent identity transfer. To improve semantic coherence and visual quality, we use expert-guided supervision, with facial parsing and gaze-consistency modules. Unlike GAN-based or implicit-fusion methods, our diffusion framework provides stable training, robust generalization, and spatially adaptive identity alignment, allowing for fine-grained regional control across pose and expression variations. CA-IDD achieves an FID of 11.73, exceeding established baselines such as FaceShifter and MegaFS. Qualitative results also reveal improved identity retention across diverse poses, establishing CA-IDD as a strong foundation for future diffusion-based face editing.
Problem

Research questions and friction points this paper is trying to address.

face swapping
identity preservation
visual realism
mode collapse
controllability
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion model
cross-attention
identity-consistent face swapping
multi-modal guidance
face editing