🤖 AI Summary
Social media memes—multimodal artifacts combining images and text with rich cultural metaphors—pose significant challenges for hate speech detection due to their implicit, context-dependent semantics. To address this, we propose a context-enhanced multimodal detection framework: (1) a hierarchical LoRA-based fine-tuning strategy for the CLIP text encoder—preserving the frozen visual backbone while efficiently adapting textual representations to hate-related semantics; and (2) generative image captioning augmented with a hate-relevance scoring mechanism to strengthen cross-modal alignment and enable fine-grained semantic modeling. On the Hateful Memes benchmark, our method achieves 0.807 accuracy and 0.806 F1-score, matching state-of-the-art performance with improved inference efficiency. Crucially, it generalizes well across domains, attaining 0.673 F1 on MultiOFF—a challenging cross-lingual, cross-domain dataset. This work establishes a scalable, low-resource, culturally sensitive paradigm for multimodal hate detection.
📝 Abstract
Social media memes are a challenging domain for hate detection because they intertwine visual and textual cues into culturally nuanced messages. We introduce a novel framework, CAMU, which leverages large vision-language models to generate more descriptive captions, a caption-scoring neural network to emphasise hate-relevant content, and parameter-efficient fine-tuning of CLIP's text encoder for an improved multimodal understanding of memes. Experiments on publicly available hateful meme datasets show that simple projection layer fine-tuning yields modest gains, whereas selectively tuning deeper text encoder layers significantly boosts performance on all evaluation metrics. Moreover, our approach attains high accuracy (0.807) and F1-score (0.806) on the Hateful Memes dataset, at par with the existing SoTA framework while being much more efficient, offering practical advantages in real-world scenarios that rely on fixed decision thresholds. CAMU also achieves the best F1-score of 0.673 on the MultiOFF dataset for offensive meme identification, demonstrating its generalisability. Additional analyses on benign confounders reveal that robust visual grounding and nuanced text representations are crucial for reliable hate and offence detection. We will publicly release CAMU along with the resultant models for further research. Disclaimer: This paper includes references to potentially disturbing, hateful, or offensive content due to the nature of the task.