Multimodal Sexism Identification and Characterization using Large Language Models and Gradient Boosting

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This study addresses the identification and characterization of multimodal gender bias in internet memes and short videos. It proposes a late-fusion framework based on gradient-boosted regression, augmented with a hierarchical post-processing strategy that integrates visual, textual, demographic, biometric, and high-level semantic features extracted by large language models (LLMs). The findings reveal that LLM-derived semantic features substantially enhance detection performance in static meme tasks, whereas temporal modeling proves critical for dynamic video tasks. Notably, using the full, unfiltered feature set during testing yields superior generalization in video analysis, underscoring a fundamental divergence in optimal processing strategies between static and dynamic modalities.

📝 Abstract

We present the AILS-NTUA submission to the EXIST 2026 Lab at CLEF, addressing multimodal sexism identification and characterization in memes (Task 2) and short-form videos (Task 3). Our system follows a feature-engineered late-fusion pipeline built around gradient-boosted regression models and hierarchical post-processing. For memes, we combine visual, textual, demographic, biometric, and LLM-derived semantic indicators designed to capture high-level cues such as stereotyping, objectification, irony, and misogyny. For videos, we investigate the effect of feature selection, frame-based visual representations, OCR-based textual features, acoustic descriptors, and sensor-derived metadata. Development results show that focused LLM-derived semantic cues improve meme sexism identification, while video performance is highly sensitive to feature dimensionality and cross-modal noise. For videos, development results favor compact feature selection, but official test results show that this conclusion does not fully transfer to unseen data, where the unfiltered representation generalizes better. Overall, our findings highlight the usefulness of targeted semantic feature engineering for static memes and the need for more robust temporal modeling in noisy short-form video settings.

Problem

Research questions and friction points this paper is trying to address.

multimodal sexism

memes

short-form videos

sexism identification

gender bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal sexism detection

LLM-derived semantic features

gradient boosting