RedEdit: Agentic Red-Teaming of Image Safety Classifiers via MCTS-Guided Photo-Editing

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

178K/year
🤖 AI Summary
This work reveals a significant yet underexplored vulnerability in existing image safety classifiers when confronted with user-style malicious edits. To address this, the study introduces agent-based red-teaming to the image safety domain, formulating evasion as a combinatorial search over sequences of editing operations. It leverages vision-language models (VLMs) to generate semantically faithful candidate edits and employs Monte Carlo Tree Search (MCTS) to efficiently plan human-like attack trajectories. Experimental results on UnsafeBench demonstrate that, on average, fewer than two edits suffice to bypass detection for 76.2% of unsafe images while preserving 93.0% of their malicious semantics, thereby exposing systemic weaknesses in current content moderation systems.
📝 Abstract
Image safety classifiers serve as a critical component of contemporary content moderation systems on the internet. However, their resilience against user-style malicious image editing remains underexplored. Such behaviors are highly prevalent in daily scenarios but difficult to fully reproduce. To explore this vulnerability, we introduce RedEdit, a novel black-box red-teaming agent that formulates photo-editing evasion as a combinatorial search problem over edit-tool sequences. It adopts a Vision-Language-Model (VLM)-based proposer to generate semantically targeted candidate edits and a Monte Carlo Tree Search (MCTS) planner to prioritize promising edit paths while backtracking from ineffective ones. Together, the proposer and planner instantiate two key capabilities of human attackers, i.e., domain knowledge and iterative backtracking, respectively, to reproduce this practical threat. Our extensive experiments on UnsafeBench reveal profound systemic vulnerabilities: fewer than two edits on average enable 76.2% of unsafe images to evade detectors, while retaining 93.0% malicious semantics, meaning that such manipulated content remains perceptually malicious to humans while easily bypassing automated moderation. We therefore appeal to the community for more attention to this overlooked practical threat.
Problem

Research questions and friction points this paper is trying to address.

image safety classifiers
red-teaming
malicious image editing
content moderation
adversarial evasion
Innovation

Methods, ideas, or system contributions that make the work stand out.

red-teaming
image safety classifier
Monte Carlo Tree Search
Vision-Language Model
adversarial photo-editing
🔎 Similar Papers