Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the detection and mitigation of multimodal hate memes—image-text composites propagating hatred on social platforms. Methodologically, we (1) introduce a novel definition-guided prompting paradigm to enhance fine-grained recognition of multimodal hate content by vision-language models (e.g., LLaVA, Gemini, GPT-4o); and (2) propose UnHateMeme, the first end-to-end meme hate mitigation framework, which performs controllable substitution of textual and visual elements while preserving semantic validity and cross-modal consistency. Experiments demonstrate that our detection module significantly outperforms state-of-the-art baselines. Human evaluation confirms that 100% of mitigated memes meet non-hateful criteria, with over 92% retention of image-text coherence. This work establishes a deployable, interpretable, and balanced paradigm for multimodal content safety—effectively reconciling mitigation efficacy with structural fidelity.

Technology Category

Application Category

📝 Abstract
The rapid evolution of social media has provided enhanced communication channels for individuals to create online content, enabling them to express their thoughts and opinions. Multimodal memes, often utilized for playful or humorous expressions with visual and textual elements, are sometimes misused to disseminate hate speech against individuals or groups. While the detection of hateful memes is well-researched, developing effective methods to transform hateful content in memes remains a significant challenge. Leveraging the powerful generation and reasoning capabilities of Vision-Language Models (VLMs), we address the tasks of detecting and mitigating hateful content. This paper presents two key contributions: first, a definition-guided prompting technique for detecting hateful memes, and second, a unified framework for mitigating hateful content in memes, named UnHateMeme, which works by replacing hateful textual and/or visual components. With our definition-guided prompts, VLMs achieve impressive performance on hateful memes detection task. Furthermore, our UnHateMeme framework, integrated with VLMs, demonstrates a strong capability to convert hateful memes into non-hateful forms that meet human-level criteria for hate speech and maintain multimodal coherence between image and text. Through empirical experiments, we show the effectiveness of state-of-the-art pretrained VLMs such as LLaVA, Gemini and GPT-4o on the proposed tasks, providing a comprehensive analysis of their respective strengths and limitations for these tasks. This paper aims to shed light on important applications of VLMs for ensuring safe and respectful online environments.
Problem

Research questions and friction points this paper is trying to address.

Detecting hateful content in multimodal memes using VLMs
Mitigating hate speech in memes via visual-textual component replacement
Ensuring multimodal coherence while converting hateful to non-hateful memes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Definition-guided prompting for hateful meme detection
UnHateMeme framework for content mitigation
Integration of VLMs like LLaVA, Gemini, GPT-4o
🔎 Similar Papers
No similar papers found.