CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face Swapping

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

Existing face-swapping methods struggle to simultaneously preserve identity and achieve visual realism under significant pose and expression variations. This work proposes a multimodal-guided face-swapping framework that, for the first time, leverages diffusion models for this task. By precomputing identity embeddings and employing a hierarchical cross-attention mechanism, the method integrates multiple signals—including identity features, gaze direction, and facial parsing maps—to enable spatially adaptive identity alignment and fine-grained regional control during the denoising process. The approach overcomes the mode collapse and limited controllability inherent in GAN-based methods, achieving a state-of-the-art FID score of 11.73 and significantly outperforming leading approaches such as FaceShifter and MegaFS, particularly in preserving identity and generating high-quality results across diverse head poses.

Technology Category

Application Category

📝 Abstract

Face swapping aims to optimize realistic facial image generation by leveraging the identity of a source face onto a target face while preserving pose, expression, and context. However, existing methods, especially GAN-based methods, often struggle to balance identity preservation and visual realism due to limited controllability and mode collapse. In this paper, we introduce CA-IDD (Cross-Attention Guided Identity-Conditional Diffusion), the first diffusion-based face swapping approach that integrates multi-modal guidance comprising gaze, identity, and facial parsing through multi-scale cross-attention. Precomputed identity embeddings are incorporated into the denoising process via hierarchical attention layers, resulting in accurate and consistent identity transfer. To improve semantic coherence and visual quality, we use expert-guided supervision, with facial parsing and gaze-consistency modules. Unlike GAN-based or implicit-fusion methods, our diffusion framework provides stable training, robust generalization, and spatially adaptive identity alignment, allowing for fine-grained regional control across pose and expression variations. CA-IDD achieves an FID of 11.73, exceeding established baselines such as FaceShifter and MegaFS. Qualitative results also reveal improved identity retention across diverse poses, establishing CA-IDD as a strong foundation for future diffusion-based face editing.

Problem

Research questions and friction points this paper is trying to address.

face swapping

identity preservation

visual realism

mode collapse

controllability

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion model

cross-attention

identity-consistent face swapping