π€ AI Summary
To address the dual challenge of holistic discrimination and fine-grained localization in illicit image moderation, this paper proposes a zero-shot malicious content detection framework integrating the Segment Anything Model (SAM) with a vision-language model (VLM). Methodologically, SAM generates candidate regions, which the VLM scores for harmfulness using open-vocabulary reasoning; these scores are then weighted and fused to produce pixel-level malicious heatmaps, while multi-segmenter ensembling enhances robustness against adaptive attacks. Key contributions include: (1) the first zero-shot framework achieving simultaneous harmful category identification, critical object localization, and pixel-level segmentation in a single forward pass; (2) state-of-the-art performance on a benchmark of 790 annotated imagesβ85.8% element recall, 78.1% precision, and 92.1% segmentation success rate; (3) strong adversarial robustness (β€10% performance drop under PGD attacks) and practical efficiency (~seconds per image), coupled with high interpretability and deployability.
π Abstract
Detecting illicit visual content demands more than image-level NSFW flags; moderators must also know what objects make an image illegal and where those objects occur. We introduce a zero-shot pipeline that simultaneously (i) detects if an image contains harmful content, (ii) identifies each critical element involved, and (iii) localizes those elements with pixel-accurate masks - all in one pass. The system first applies foundation segmentation model (SAM) to generate candidate object masks and refines them into larger independent regions. Each region is scored for malicious relevance by a vision-language model using open-vocabulary prompts; these scores weight a fusion step that produces a consolidated malicious object map. An ensemble across multiple segmenters hardens the pipeline against adaptive attacks that target any single segmentation method. Evaluated on a newly-annotated 790-image dataset spanning drug, sexual, violent and extremist content, our method attains 85.8% element-level recall, 78.1% precision and a 92.1% segment-success rate - exceeding direct zero-shot VLM localization by 27.4% recall at comparable precision. Against PGD adversarial perturbations crafted to break SAM and VLM, our method's precision and recall decreased by no more than 10%, demonstrating high robustness against attacks. The full pipeline processes an image in seconds, plugs seamlessly into existing VLM workflows, and constitutes the first practical tool for fine-grained, explainable malicious-image moderation.