RedEdit: Agentic Red-Teaming of Image Safety Classifiers via MCTS-Guided Photo-Editing

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work reveals a significant yet underexplored vulnerability in existing image safety classifiers when confronted with user-style malicious edits. To address this, the study introduces agent-based red-teaming to the image safety domain, formulating evasion as a combinatorial search over sequences of editing operations. It leverages vision-language models (VLMs) to generate semantically faithful candidate edits and employs Monte Carlo Tree Search (MCTS) to efficiently plan human-like attack trajectories. Experimental results on UnsafeBench demonstrate that, on average, fewer than two edits suffice to bypass detection for 76.2% of unsafe images while preserving 93.0% of their malicious semantics, thereby exposing systemic weaknesses in current content moderation systems.

📝 Abstract

Image safety classifiers serve as a critical component of contemporary content moderation systems on the internet. However, their resilience against user-style malicious image editing remains underexplored. Such behaviors are highly prevalent in daily scenarios but difficult to fully reproduce. To explore this vulnerability, we introduce RedEdit, a novel black-box red-teaming agent that formulates photo-editing evasion as a combinatorial search problem over edit-tool sequences. It adopts a Vision-Language-Model (VLM)-based proposer to generate semantically targeted candidate edits and a Monte Carlo Tree Search (MCTS) planner to prioritize promising edit paths while backtracking from ineffective ones. Together, the proposer and planner instantiate two key capabilities of human attackers, i.e., domain knowledge and iterative backtracking, respectively, to reproduce this practical threat. Our extensive experiments on UnsafeBench reveal profound systemic vulnerabilities: fewer than two edits on average enable 76.2% of unsafe images to evade detectors, while retaining 93.0% malicious semantics, meaning that such manipulated content remains perceptually malicious to humans while easily bypassing automated moderation. We therefore appeal to the community for more attention to this overlooked practical threat.

Problem

Research questions and friction points this paper is trying to address.

image safety classifiers

red-teaming

malicious image editing

content moderation

adversarial evasion

Innovation

Methods, ideas, or system contributions that make the work stand out.

red-teaming

image safety classifier

Monte Carlo Tree Search