CIEC: Coupling Implicit and Explicit Cues for Multimodal Weakly Supervised Manipulation Localization

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high cost of fine-grained annotation in multimodal misinformation detection by proposing the CIEC framework, which achieves precise localization of manipulated regions under a weakly supervised setting using only image- or sentence-level coarse labels. CIEC is the first approach to jointly exploit implicit and explicit cues from both textual and visual modalities, introducing a text-guided region proposal selection (TRPS) module and a vision-bias-calibrated token grounding (VCTG) module. The framework further incorporates spatial contrastive learning, semantic consistency constraints, background suppression, and asymmetric sparsity regularization to enable cross-modal guidance and collaborative optimization. Experimental results demonstrate that CIEC significantly outperforms existing weakly supervised methods across multiple metrics and approaches the performance of fully supervised models.

Technology Category

Application Category

📝 Abstract
To mitigate the threat of misinformation, multimodal manipulation localization has garnered growing attention. Consider that current methods rely on costly and time-consuming fine-grained annotations, such as patch/token-level annotations. This paper proposes a novel framework named Coupling Implicit and Explicit Cues (CIEC), which aims to achieve multimodal weakly-supervised manipulation localization for image-text pairs utilizing only coarse-grained image/sentence-level annotations. It comprises two branches, image-based and text-based weakly-supervised localization. For the former, we devise the Textual-guidance Refine Patch Selection (TRPS) module. It integrates forgery cues from both visual and textual perspectives to lock onto suspicious regions aided by spatial priors. Followed by the background silencing and spatial contrast constraints to suppress interference from irrelevant areas. For the latter, we devise the Visual-deviation Calibrated Token Grounding (VCTG) module. It focuses on meaningful content words and leverages relative visual bias to assist token localization. Followed by the asymmetric sparse and semantic consistency constraints to mitigate label noise and ensure reliability. Extensive experiments demonstrate the effectiveness of our CIEC, yielding results comparable to fully supervised methods on several evaluation metrics.
Problem

Research questions and friction points this paper is trying to address.

multimodal manipulation localization
weakly supervised learning
coarse-grained annotation
image-text pairs
forgery detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

weakly-supervised localization
multimodal manipulation detection
implicit-explicit cue coupling
text-guided patch selection
visual-deviation calibrated token grounding
🔎 Similar Papers
No similar papers found.
X
Xinquan Yu
School of Computer Science and Engineering, MoE Key Laboratory of Information Technology, Guangdong Province Key Laboratory of Information Security Technology, Sun Yat-sen University, Guangzhou 510006, China
Wei Lu
Wei Lu
Sun Yat-sen University
computer science
Xiangyang Luo
Xiangyang Luo
Zhengzhou Information Science and Technology Institute
information hidingdata hiding steganography
R
Rui Yang
Alibaba Group, Hangzhou, China