Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal hate speech detection methods—particularly for videos and memes—suffer from inadequate cross-modal interaction modeling, and the robustness of current fusion strategies remains poorly understood. Method: We systematically evaluate fusion strategies across modalities and identify that modality-specific characteristics critically impact performance: embedding-level fusion improves F1 by 9.9 points on HateMM videos but fails on memes. Guided by this insight, we propose a *modality-characteristic-driven architecture design* principle and develop a feature-embedding-based multimodal framework. Rigorous validation includes ablation studies, error analysis, and cross-dataset evaluation (HateMM and Hateful Memes). Contributions: Our approach achieves state-of-the-art performance on HateMM (F1 = XX.X); it is the first to explicitly characterize failure boundaries in benign perturbation robustness and image-text semantic alignment; and it establishes a reproducible benchmark and principled design paradigm for multimodal hate detection.

Technology Category

Application Category

📝 Abstract
Social media platforms enable the propagation of hateful content across different modalities such as textual, auditory, and visual, necessitating effective detection methods. While recent approaches have shown promise in handling individual modalities, their effectiveness across different modality combinations remains unexplored. This paper presents a systematic analysis of fusion-based approaches for multimodal hate detection, focusing on their performance across video and image-based content. Our comprehensive evaluation reveals significant modality-specific limitations: while simple embedding fusion achieves state-of-the-art performance on video content (HateMM dataset) with a 9.9% points F1-score improvement, it struggles with complex image-text relationships in memes (Hateful Memes dataset). Through detailed ablation studies and error analysis, we demonstrate how current fusion approaches fail to capture nuanced cross-modal interactions, particularly in cases involving benign confounders. Our findings provide crucial insights for developing more robust hate detection systems and highlight the need for modality-specific architectural considerations. The code is available at https://github.com/gak97/Video-vs-Meme-Hate.
Problem

Research questions and friction points this paper is trying to address.

Multimodal hate detection in social media
Performance across video and image-based content
Limitations in capturing cross-modal interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fusion-based multimodal hate detection
Analyzes video and image content
Identifies modality-specific limitations
🔎 Similar Papers
No similar papers found.