Malicious Image Analysis via Vision-Language Segmentation Fusion: Detection, Element, and Location in One-shot

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the dual challenge of holistic discrimination and fine-grained localization in illicit image moderation, this paper proposes a zero-shot malicious content detection framework integrating the Segment Anything Model (SAM) with a vision-language model (VLM). Methodologically, SAM generates candidate regions, which the VLM scores for harmfulness using open-vocabulary reasoning; these scores are then weighted and fused to produce pixel-level malicious heatmaps, while multi-segmenter ensembling enhances robustness against adaptive attacks. Key contributions include: (1) the first zero-shot framework achieving simultaneous harmful category identification, critical object localization, and pixel-level segmentation in a single forward pass; (2) state-of-the-art performance on a benchmark of 790 annotated images—85.8% element recall, 78.1% precision, and 92.1% segmentation success rate; (3) strong adversarial robustness (≤10% performance drop under PGD attacks) and practical efficiency (~seconds per image), coupled with high interpretability and deployability.

Technology Category

Application Category

📝 Abstract

Detecting illicit visual content demands more than image-level NSFW flags; moderators must also know what objects make an image illegal and where those objects occur. We introduce a zero-shot pipeline that simultaneously (i) detects if an image contains harmful content, (ii) identifies each critical element involved, and (iii) localizes those elements with pixel-accurate masks - all in one pass. The system first applies foundation segmentation model (SAM) to generate candidate object masks and refines them into larger independent regions. Each region is scored for malicious relevance by a vision-language model using open-vocabulary prompts; these scores weight a fusion step that produces a consolidated malicious object map. An ensemble across multiple segmenters hardens the pipeline against adaptive attacks that target any single segmentation method. Evaluated on a newly-annotated 790-image dataset spanning drug, sexual, violent and extremist content, our method attains 85.8% element-level recall, 78.1% precision and a 92.1% segment-success rate - exceeding direct zero-shot VLM localization by 27.4% recall at comparable precision. Against PGD adversarial perturbations crafted to break SAM and VLM, our method's precision and recall decreased by no more than 10%, demonstrating high robustness against attacks. The full pipeline processes an image in seconds, plugs seamlessly into existing VLM workflows, and constitutes the first practical tool for fine-grained, explainable malicious-image moderation.

Problem

Research questions and friction points this paper is trying to address.

Detects harmful content in images with one-shot analysis

Identifies and localizes malicious elements using pixel-accurate masks

Ensures robustness against adversarial attacks in content moderation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot pipeline for detection, identification, localization

Fusion of segmentation and vision-language models for scoring

Ensemble of segmenters for robustness against attacks

🔎 Similar Papers

No similar papers found.

Authors to Follow