Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Visual-language models (VLMs) are vulnerable to multimodal jailbreaking attacks, while conventional static moderation lacks contextual awareness and interpretability. To address these limitations, we propose Agentic Moderation—a dynamic, agent-based safety alignment framework that employs collaborative multi-agent reasoning. The framework comprises four specialized agents—Shield (guarding), Responder (reacting), Evaluator (assessing), and Reflector (reflecting)—which jointly perform fine-grained, context-aware, and interpretable modular moderation through autonomous inference, dynamic interaction, and iterative reflection. Model-agnostic and plug-and-play, it integrates seamlessly with existing VLMs. Extensive experiments across five benchmark datasets and four state-of-the-art VLMs demonstrate that our approach reduces attack success rates by 7–19%, increases rejection rates by 4–20%, and maintains stable non-compliance rates—thereby significantly improving the safety-usability trade-off.

Technology Category

Application Category

📝 Abstract

Agentic methods have emerged as a powerful and autonomous paradigm that enhances reasoning, collaboration, and adaptive control, enabling systems to coordinate and independently solve complex tasks. We extend this paradigm to safety alignment by introducing Agentic Moderation, a model-agnostic framework that leverages specialised agents to defend multimodal systems against jailbreak attacks. Unlike prior approaches that apply as a static layer over inputs or outputs and provide only binary classifications (safe or unsafe), our method integrates dynamic, cooperative agents, including Shield, Responder, Evaluator, and Reflector, to achieve context-aware and interpretable moderation. Extensive experiments across five datasets and four representative Large Vision-Language Models (LVLMs) demonstrate that our approach reduces the Attack Success Rate (ASR) by 7-19%, maintains a stable Non-Following Rate (NF), and improves the Refusal Rate (RR) by 4-20%, achieving robust, interpretable, and well-balanced safety performance. By harnessing the flexibility and reasoning capacity of agentic architectures, Agentic Moderation provides modular, scalable, and fine-grained safety enforcement, highlighting the broader potential of agentic systems as a foundation for automated safety governance.

Problem

Research questions and friction points this paper is trying to address.

Defending multimodal AI systems against jailbreak attacks using specialized agents

Achieving context-aware and interpretable safety moderation through dynamic cooperation

Improving refusal rates while maintaining balanced safety performance in vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework for vision-language model safety

Dynamic cooperative agents enable context-aware moderation

Modular scalable safety enforcement through agentic architecture

🔎 Similar Papers

No similar papers found.

Authors to Follow