ImpliHateVid: A Benchmark Dataset and Two-stage Contrastive Learning Framework for Implicit Hate Speech Detection in Videos

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Implicit hate speech detection in videos remains underexplored, hindered by the absence of large-scale, annotated multimodal benchmarks. Method: We introduce ImpliHateVid—the first large-scale, multimodal dataset specifically designed for implicit hate speech in video—and propose a two-stage contrastive learning framework. Stage one employs modality-specific encoders to extract features from audio, text, visual content, and affective/sentiment/caption modalities. Stage two performs cross-modal contrastive learning to refine joint representations, enhancing modeling of complex implicit hateful semantics. Contribution/Results: Our approach integrates multimodal deep learning, cross-encoder training, and fine-grained semantic analysis. It achieves state-of-the-art performance on both ImpliHateVid and the existing HateMM benchmark, significantly outperforming prior baselines. Empirical results validate both the utility of the proposed dataset and the generalizability and effectiveness of the framework.

Technology Category

Application Category

📝 Abstract
The existing research has primarily focused on text and image-based hate speech detection, video-based approaches remain underexplored. In this work, we introduce a novel dataset, ImpliHateVid, specifically curated for implicit hate speech detection in videos. ImpliHateVid consists of 2,009 videos comprising 509 implicit hate videos, 500 explicit hate videos, and 1,000 non-hate videos, making it one of the first large-scale video datasets dedicated to implicit hate detection. We also propose a novel two-stage contrastive learning framework for hate speech detection in videos. In the first stage, we train modality-specific encoders for audio, text, and image using contrastive loss by concatenating features from the three encoders. In the second stage, we train cross-encoders using contrastive learning to refine multimodal representations. Additionally, we incorporate sentiment, emotion, and caption-based features to enhance implicit hate detection. We evaluate our method on two datasets, ImpliHateVid for implicit hate speech detection and another dataset for general hate speech detection in videos, HateMM dataset, demonstrating the effectiveness of the proposed multimodal contrastive learning for hateful content detection in videos and the significance of our dataset.
Problem

Research questions and friction points this paper is trying to address.

Detecting implicit hate speech in videos using multimodal data
Addressing lack of large-scale video datasets for hate detection
Improving hate speech detection with contrastive learning framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage contrastive learning for video hate detection
Multimodal encoders for audio, text, image
Sentiment, emotion, caption features enhance detection
🔎 Similar Papers
No similar papers found.
M
Mohammad Zia Ur Rehman
Indian Institute of Technology Indore, Indore, India
A
Anukriti Bhatnagar
Indian Institute of Technology Indore, Indore, India
O
Omkar Kabde
Chaitanya Bharathi Institute of Technology, Telangana, India
Shubhi Bansal
Shubhi Bansal
Prime Minister's Research Fellow (PMRF), Indian Institute of Technology, Indore
Natural Language ProcessingRecommender SystemsPersonalizationData MiningInformation Retrieval
N
Nagendra Kumar
Indian Institute of Technology Indore, Indore, India