FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Current vision-language models (VLMs) for hateful meme detection rely heavily on dataset-specific heuristics and struggle to disentangle the causal contributions of rhetorical mechanisms from those of targeted groups, resulting in poor generalization. To address this limitation, this work introduces FBHM—the first functional benchmark for hateful memes—featuring an orthogonal combination of 25 rhetorical functions and 10 target groups. The authors propose Learnable Steering Vectors (LSV), a method that leverages causal intervention to substantially enhance model robustness with only 500 training samples. Experiments reveal that state-of-the-art VLMs drop to near-random performance on FBHM, whereas LSV improves Macro-F1 by approximately 30 points without compromising source-domain accuracy, significantly outperforming both in-context learning and parameter-efficient fine-tuning approaches.

📝 Abstract

Hateful meme detection remains a formidable challenge for vision-language models, as existing benchmarks are structurally observational - confounding rhetorical hate mechanisms with target community features and preventing causal evaluation of model vulnerabilities. To address this, we introduce FBHM, a systematically curated benchmark of Functionality Based Hateful Memes constructed along two orthogonal axes: 25 distinct rhetorical functionalities and 10 target communities (5,000 memes total). Benchmarking state-of-the-art VLMs reveals a severe generalization gap: models highly accurate on standard datasets catastrophically drop to near-random performance on FBHM, proving they exploit dataset-specific heuristics rather than robust multimodal reasoning. To efficiently close this gap, we propose LSV (learnable steering vectors), an ultra-low data regime strategy that applies a causal intervention objective on as few as 500 steering samples (50 unique base memes), boosting FBHM performance by ~30 Macro-F1 points while outperforming in-context learning and PEFT without degrading source-domain performance.

Problem

Research questions and friction points this paper is trying to address.

hateful meme detection

vision-language models

benchmarking

causal evaluation

rhetorical functionalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Functional Benchmarking

Causal Intervention

Learnable Steering Vectors