CIEC: Coupling Implicit and Explicit Cues for Multimodal Weakly Supervised Manipulation Localization

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the high cost of fine-grained annotation in multimodal misinformation detection by proposing the CIEC framework, which achieves precise localization of manipulated regions under a weakly supervised setting using only image- or sentence-level coarse labels. CIEC is the first approach to jointly exploit implicit and explicit cues from both textual and visual modalities, introducing a text-guided region proposal selection (TRPS) module and a vision-bias-calibrated token grounding (VCTG) module. The framework further incorporates spatial contrastive learning, semantic consistency constraints, background suppression, and asymmetric sparsity regularization to enable cross-modal guidance and collaborative optimization. Experimental results demonstrate that CIEC significantly outperforms existing weakly supervised methods across multiple metrics and approaches the performance of fully supervised models.

Technology Category

Application Category

📝 Abstract

To mitigate the threat of misinformation, multimodal manipulation localization has garnered growing attention. Consider that current methods rely on costly and time-consuming fine-grained annotations, such as patch/token-level annotations. This paper proposes a novel framework named Coupling Implicit and Explicit Cues (CIEC), which aims to achieve multimodal weakly-supervised manipulation localization for image-text pairs utilizing only coarse-grained image/sentence-level annotations. It comprises two branches, image-based and text-based weakly-supervised localization. For the former, we devise the Textual-guidance Refine Patch Selection (TRPS) module. It integrates forgery cues from both visual and textual perspectives to lock onto suspicious regions aided by spatial priors. Followed by the background silencing and spatial contrast constraints to suppress interference from irrelevant areas. For the latter, we devise the Visual-deviation Calibrated Token Grounding (VCTG) module. It focuses on meaningful content words and leverages relative visual bias to assist token localization. Followed by the asymmetric sparse and semantic consistency constraints to mitigate label noise and ensure reliability. Extensive experiments demonstrate the effectiveness of our CIEC, yielding results comparable to fully supervised methods on several evaluation metrics.

Problem

Research questions and friction points this paper is trying to address.

multimodal manipulation localization

weakly supervised learning

coarse-grained annotation

image-text pairs

forgery detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

weakly-supervised localization

multimodal manipulation detection

implicit-explicit cue coupling

text-guided patch selection

visual-deviation calibrated token grounding

🔎 Similar Papers

No similar papers found.

Authors to Follow