Malicious Image Analysis via Vision-Language Segmentation Fusion: Detection, Element, and Location in One-shot

πŸ“… 2025-12-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the dual challenge of holistic discrimination and fine-grained localization in illicit image moderation, this paper proposes a zero-shot malicious content detection framework integrating the Segment Anything Model (SAM) with a vision-language model (VLM). Methodologically, SAM generates candidate regions, which the VLM scores for harmfulness using open-vocabulary reasoning; these scores are then weighted and fused to produce pixel-level malicious heatmaps, while multi-segmenter ensembling enhances robustness against adaptive attacks. Key contributions include: (1) the first zero-shot framework achieving simultaneous harmful category identification, critical object localization, and pixel-level segmentation in a single forward pass; (2) state-of-the-art performance on a benchmark of 790 annotated imagesβ€”85.8% element recall, 78.1% precision, and 92.1% segmentation success rate; (3) strong adversarial robustness (≀10% performance drop under PGD attacks) and practical efficiency (~seconds per image), coupled with high interpretability and deployability.

Technology Category

Application Category

πŸ“ Abstract
Detecting illicit visual content demands more than image-level NSFW flags; moderators must also know what objects make an image illegal and where those objects occur. We introduce a zero-shot pipeline that simultaneously (i) detects if an image contains harmful content, (ii) identifies each critical element involved, and (iii) localizes those elements with pixel-accurate masks - all in one pass. The system first applies foundation segmentation model (SAM) to generate candidate object masks and refines them into larger independent regions. Each region is scored for malicious relevance by a vision-language model using open-vocabulary prompts; these scores weight a fusion step that produces a consolidated malicious object map. An ensemble across multiple segmenters hardens the pipeline against adaptive attacks that target any single segmentation method. Evaluated on a newly-annotated 790-image dataset spanning drug, sexual, violent and extremist content, our method attains 85.8% element-level recall, 78.1% precision and a 92.1% segment-success rate - exceeding direct zero-shot VLM localization by 27.4% recall at comparable precision. Against PGD adversarial perturbations crafted to break SAM and VLM, our method's precision and recall decreased by no more than 10%, demonstrating high robustness against attacks. The full pipeline processes an image in seconds, plugs seamlessly into existing VLM workflows, and constitutes the first practical tool for fine-grained, explainable malicious-image moderation.
Problem

Research questions and friction points this paper is trying to address.

Detects harmful content in images with one-shot analysis
Identifies and localizes malicious elements using pixel-accurate masks
Ensures robustness against adversarial attacks in content moderation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot pipeline for detection, identification, localization
Fusion of segmentation and vision-language models for scoring
Ensemble of segmenters for robustness against attacks
πŸ”Ž Similar Papers
No similar papers found.
S
Sheng Hang
Shanghai Jiao Tong University, Shanghai, China
C
Chaoxiang He
Shanghai Jiao Tong University, Shanghai, China
Hongsheng Hu
Hongsheng Hu
Lecturer, School of Information and Physical Sciences, University of Newcastle
Trustworthy Machine LearningMachine Unlearning
H
Hanqing Hu
Shanghai Jiao Tong University, Shanghai, China
B
Bin Benjamin Zhu
Microsoft Corporation, China
Shi-Feng Sun
Shi-Feng Sun
Shanghai Jiao Tong University, Shanghai, China
D
Dawu Gu
Shanghai Jiao Tong University, Shanghai, China
S
Shuo Wang
Shanghai Jiao Tong University, Shanghai, China