EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In e-commerce settings, adversarial content—superficially compliant yet substantively non-compliant—undermines the reliability of LLMs and VLMs for policy violation detection. Method: This paper introduces the first expert-annotated, Chinese e-commerce–specific multimodal adversarial content detection benchmark. It proposes a dual-task evaluation paradigm—Single-Violation (isolated rule assessment) and All-in-One (integrated rule application)—and systematically evaluates 26 state-of-the-art multimodal models on 2,833 text instances and 13,961 images. Fine-grained policy modeling, multimodal adversarial sample construction, and long-context reasoning evaluation are employed. Contribution/Results: The study reveals that rule clarity critically governs human-AI alignment in judgment. All-in-One evaluation reduces the accuracy gap between partial and exact match metrics by over 40%, demonstrating that holistic rule integration substantially enhances model interpretability and reliability. Across all models, pervasive robustness deficits are empirically confirmed, highlighting urgent needs for improved multimodal compliance reasoning.

Technology Category

Application Category

📝 Abstract
E-commerce platforms increasingly rely on Large Language Models (LLMs) and Vision-Language Models (VLMs) to detect illicit or misleading product content. However, these models remain vulnerable to evasive content: inputs (text or images) that superficially comply with platform policies while covertly conveying prohibited claims. Unlike traditional adversarial attacks that induce overt failures, evasive content exploits ambiguity and context, making it far harder to detect. Existing robustness benchmarks provide little guidance for this demanding, real-world challenge. We introduce EVADE, the first expert-curated, Chinese, multimodal benchmark specifically designed to evaluate foundation models on evasive content detection in e-commerce. The dataset contains 2,833 annotated text samples and 13,961 images spanning six demanding product categories, including body shaping, height growth, and health supplements. Two complementary tasks assess distinct capabilities: Single-Violation, which probes fine-grained reasoning under short prompts, and All-in-One, which tests long-context reasoning by merging overlapping policy rules into unified instructions. Notably, the All-in-One setting significantly narrows the performance gap between partial and full-match accuracy, suggesting that clearer rule definitions improve alignment between human and model judgment. We benchmark 26 mainstream LLMs and VLMs and observe substantial performance gaps: even state-of-the-art models frequently misclassify evasive samples. By releasing EVADE and strong baselines, we provide the first rigorous standard for evaluating evasive-content detection, expose fundamental limitations in current multimodal reasoning, and lay the groundwork for safer and more transparent content moderation systems in e-commerce. The dataset is publicly available at https://huggingface.co/datasets/koenshen/EVADE-Bench.
Problem

Research questions and friction points this paper is trying to address.

Detecting evasive content in e-commerce using multimodal models
Evaluating model robustness against ambiguous policy-violating inputs
Addressing performance gaps in multimodal reasoning for content moderation
Innovation

Methods, ideas, or system contributions that make the work stand out.

First expert-curated Chinese multimodal benchmark
Assesses models on evasive content detection
Includes 2,833 text and 13,961 image samples
🔎 Similar Papers
A
Ancheng Xu
Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Z
Zhihao Yang
University of Chinese Academy of Sciences, Tongji University
Jingpeng Li
Jingpeng Li
Reader of Computer Science, University of Stirling, UK
Transport SchedulingMetaheuristicsMulti-Objective OptimisationMachine LearningSearch-Based Software Engineering
G
Guanghu Yuan
Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Longze Chen
Longze Chen
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Natural Language Processing
L
Liang Yan
Alibaba Group
J
Jiehui Zhou
Alibaba Group
Z
Zhen Qin
Alibaba Group
H
Hengyun Chang
Alibaba Group
Hamid Alinejad-Rokny
Hamid Alinejad-Rokny
ARC DECRA & UNSW Scientia Fellow, Head of BioMedical Machine Learning Lab
BioMedical Machine LearningMachine Learning for HealthMedical Artificial IntelligenceLLMs
B
Bo Zheng
Alibaba Group
Min Yang
Min Yang
Bytedance
Vision Language ModelComputer VisionVideo Understanding