🤖 AI Summary
Hateful memes exhibit high dynamism, poor cross-domain generalization, and insufficient robustness in low-resource settings. To address these challenges, this paper proposes LMM-RGCL, a two-stage fine-tuning framework featuring the first-ever retrieval-guided contrastive learning mechanism. In Stage I, cross-domain retrieval-augmented fine-tuning enhances domain adaptability; in Stage II, modality-aligned contrastive learning strengthens semantic discriminability. Unlike conventional supervised fine-tuning, LMM-RGCL jointly optimizes in-domain accuracy and cross-domain generalization. Evaluated on six mainstream benchmarks, it achieves state-of-the-art performance—significantly outperforming VPD-PALI-X-55B and GPT-4o—especially under few-shot and cross-domain settings. The framework demonstrates superior robustness and generalizability in resource-constrained scenarios, establishing a new paradigm for hateful meme detection.
📝 Abstract
Hateful memes have become a significant concern on the Internet, necessitating robust automated detection systems. While large multimodal models have shown strong generalization across various tasks, they exhibit poor generalization to hateful meme detection due to the dynamic nature of memes tied to emerging social trends and breaking news. Recent work further highlights the limitations of conventional supervised fine-tuning for large multimodal models in this context. To address these challenges, we propose Large Multimodal Model Retrieval-Guided Contrastive Learning (LMM-RGCL), a novel two-stage fine-tuning framework designed to improve both in-domain accuracy and cross-domain generalization. Experimental results on six widely used meme classification datasets demonstrate that LMM-RGCL achieves state-of-the-art performance, outperforming agent-based systems such as VPD-PALI-X-55B. Furthermore, our method effectively generalizes to out-of-domain memes under low-resource settings, surpassing models like GPT-4o.