Labels or Input? Rethinking Augmentation in Multimodal Hate Detection

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Detecting multimodal hate speech—particularly when implicitly conveyed through humor or irony in image-text interactions—remains challenging due to the subtlety of malicious semantics and the limitations of existing vision-language models, which suffer from coarse-grained supervision and susceptibility to implicit biases. To address this, we propose a structured prompt optimization framework coupled with a multi-agent counterfactual data augmentation pipeline. Our approach integrates fine-grained annotation, modality-isolated rewriting, and collaborative reasoning to enhance model discrimination of nuanced hateful intent. Evaluated on InternVL2, our method achieves state-of-the-art F1 scores on both binary and hierarchical hate detection tasks. We generate 2,479 neutral counterfactual samples, effectively mitigating spurious correlations while improving generalization and fairness. This work establishes a novel, interpretable, and robust paradigm for implicit multimodal hate speech detection.

Technology Category

Application Category

📝 Abstract

The modern web is saturated with multimodal content, intensifying the challenge of detecting hateful memes, where harmful intent is often conveyed through subtle interactions between text and image under the guise of humor or satire. While recent advances in Vision-Language Models (VLMs) show promise, these models lack support for fine-grained supervision and remain susceptible to implicit hate speech. In this paper, we present a dual-pronged approach to improve multimodal hate detection. First, we propose a prompt optimization framework that systematically varies prompt structure, supervision granularity, and training modality. We show that prompt design and label scaling both influence performance, with structured prompts improving robustness even in small models, and InternVL2 achieving the best F1-scores across binary and scaled settings. Second, we introduce a multimodal data augmentation pipeline that generates 2,479 counterfactually neutral memes by isolating and rewriting the hateful modality. This pipeline, powered by a multi-agent LLM-VLM setup, successfully reduces spurious correlations and improves classifier generalization. Our approaches inspire new directions for building synthetic data to train robust and fair vision-language models. Our findings demonstrate that prompt structure and data composition are as critical as model size, and that targeted augmentation can support more trustworthy and context-sensitive hate detection.

Problem

Research questions and friction points this paper is trying to address.

Improving multimodal hate detection in memes

Reducing spurious correlations through data augmentation

Enhancing model robustness with prompt optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt optimization framework for multimodal hate detection

Multimodal data augmentation pipeline generating counterfactual memes

Multi-agent LLM-VLM setup reducing spurious correlations

🔎 Similar Papers

HateSieve: A Contrastive Learning Framework for Detecting and Segmenting Hateful Content in Multimodal Memes