🤖 AI Summary
Detecting instruction-guided deepfakes—especially text-driven fine-grained image editing—remains highly challenging. To address this, we propose the first multimodal capsule network specifically designed for this task. Our method innovatively fuses visual, textual, and low-level frequency-domain features, and introduces two key mechanisms: (1) a cross-modal low-level capsule collaboration module for joint feature encoding, and (2) a competitive high-level capsule dynamic routing mechanism for context-aware, fine-grained localization of tampered regions. Technically, we make three contributions: (i) the first application of capsule networks to instruction-guided deepfake detection; (ii) integration of adversarial robust training; and (iii) explicit design for cross-dataset generalization. Experiments on four major editing datasets—including MagicBrush—show our method outperforms state-of-the-art methods by up to 20% in detection accuracy. It achieves >94% detection rate under natural perturbations and >96% under adversarial attacks, demonstrating strong generalization and robustness.
📝 Abstract
The rapid evolution of deepfake technology, particularly in instruction-guided image editing, threatens the integrity of digital images by enabling subtle, context-aware manipulations. Generated conditionally from real images and textual prompts, these edits are often imperceptible to both humans and existing detection systems, revealing significant limitations in current defenses. We propose a novel multimodal capsule network, CapsFake, designed to detect such deepfake image edits by integrating low-level capsules from visual, textual, and frequency-domain modalities. High-level capsules, predicted through a competitive routing mechanism, dynamically aggregate local features to identify manipulated regions with precision. Evaluated on diverse datasets, including MagicBrush, Unsplash Edits, Open Images Edits, and Multi-turn Edits, CapsFake outperforms state-of-the-art methods by up to 20% in detection accuracy. Ablation studies validate its robustness, achieving detection rates above 94% under natural perturbations and 96% against adversarial attacks, with excellent generalization to unseen editing scenarios. This approach establishes a powerful framework for countering sophisticated image manipulations.