🤖 AI Summary
Detecting multimodal hate speech—particularly when implicitly conveyed through humor or irony in image-text interactions—remains challenging due to the subtlety of malicious semantics and the limitations of existing vision-language models, which suffer from coarse-grained supervision and susceptibility to implicit biases. To address this, we propose a structured prompt optimization framework coupled with a multi-agent counterfactual data augmentation pipeline. Our approach integrates fine-grained annotation, modality-isolated rewriting, and collaborative reasoning to enhance model discrimination of nuanced hateful intent. Evaluated on InternVL2, our method achieves state-of-the-art F1 scores on both binary and hierarchical hate detection tasks. We generate 2,479 neutral counterfactual samples, effectively mitigating spurious correlations while improving generalization and fairness. This work establishes a novel, interpretable, and robust paradigm for implicit multimodal hate speech detection.
📝 Abstract
The modern web is saturated with multimodal content, intensifying the challenge of detecting hateful memes, where harmful intent is often conveyed through subtle interactions between text and image under the guise of humor or satire. While recent advances in Vision-Language Models (VLMs) show promise, these models lack support for fine-grained supervision and remain susceptible to implicit hate speech. In this paper, we present a dual-pronged approach to improve multimodal hate detection. First, we propose a prompt optimization framework that systematically varies prompt structure, supervision granularity, and training modality. We show that prompt design and label scaling both influence performance, with structured prompts improving robustness even in small models, and InternVL2 achieving the best F1-scores across binary and scaled settings. Second, we introduce a multimodal data augmentation pipeline that generates 2,479 counterfactually neutral memes by isolating and rewriting the hateful modality. This pipeline, powered by a multi-agent LLM-VLM setup, successfully reduces spurious correlations and improves classifier generalization. Our approaches inspire new directions for building synthetic data to train robust and fair vision-language models. Our findings demonstrate that prompt structure and data composition are as critical as model size, and that targeted augmentation can support more trustworthy and context-sensitive hate detection.