🤖 AI Summary
Visual compliance verification faces challenges in jointly reasoning over fine-grained visual understanding and complex, structured policy rules. Existing approaches rely either on costly human-annotated data or domain-specific models with limited generalizability; meanwhile, multimodal large language models (MLLMs) struggle to precisely interpret visual details and enforce formal regulatory constraints. This paper proposes the first agent-based framework for visual compliance verification, introducing a novel dual-agent collaboration mechanism—comprising a planning agent and a verification agent—augmented by dynamic tool selection. The framework synergistically invokes specialized vision tools (e.g., object detection, face analysis, NSFW classification, image captioning) and MLLMs to enable stepwise, interpretable, and multimodal reasoning over both policy semantics and visual evidence. Evaluated on UnsafeBench, our method achieves an F1 score of 76%, outperforming the prior state-of-the-art by 10 percentage points and significantly surpassing end-to-end classifiers and direct prompting baselines.
📝 Abstract
Visual compliance verification is a critical yet underexplored problem in computer vision, especially in domains such as media, entertainment, and advertising where content must adhere to complex and evolving policy rules. Existing methods often rely on task-specific deep learning models trained on manually labeled datasets, which are costly to build and limited in generalizability. While recent multi-modal large language models (MLLMs) offer broad real-world knowledge and policy understanding, they struggle to reason over fine-grained visual details and apply structured compliance rules effectively on their own. In this paper, we propose CompAgent, the first agentic framework for visual compliance verification. CompAgent augments MLLMs with a suite of visual tools - such as object detectors, face analyzers, NSFW detectors, and captioning models - and introduces a planning agent that dynamically selects appropriate tools based on the compliance policy. A verification agent then integrates image, tool outputs, and policy context to perform multi-modal reasoning. Experiments on public benchmarks show that CompAgent outperforms specialized classifiers, direct MLLM prompting, and curated routing baselines, achieving up to 76% F1 score and a 10% improvement over the state-of-the-art on the UnsafeBench dataset. Our results demonstrate the effectiveness of agentic planning and tool-augmented reasoning for scalable, accurate, and adaptable visual compliance verification.