Exploring Vision Language Models for Multimodal and Multilingual Stance Detection

📅 2025-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates mainstream vision-language models (VLMs) on multilingual multimodal stance detection across seven languages, focusing on text-image synergy, cross-lingual generalization, and modality utilization preferences. Leveraging an extended multilingual image-text stance dataset, we conduct zero-shot and fine-tuned experiments, complemented by attribution analysis and cross-lingual comparison. We uncover a pervasive “text-over-image” bias in VLMs—particularly strong reliance on in-image OCR text—while observing high inter-language prediction consistency uncorrelated with explicit multilingual capability; notably, certain compact models exhibit anomalous F1 deviations. Our core contributions are threefold: (1) the first behavior-oriented evaluation framework for VLMs in multilingual stance detection; (2) empirical identification of the intrinsic mechanism underlying imbalanced modality utilization; and (3) theoretically grounded insights and actionable directions for building trustworthy multimodal social computing systems.

Technology Category

Application Category

📝 Abstract
Social media's global reach amplifies the spread of information, highlighting the need for robust Natural Language Processing tasks like stance detection across languages and modalities. Prior research predominantly focuses on text-only inputs, leaving multimodal scenarios, such as those involving both images and text, relatively underexplored. Meanwhile, the prevalence of multimodal posts has increased significantly in recent years. Although state-of-the-art Vision-Language Models (VLMs) show promise, their performance on multimodal and multilingual stance detection tasks remains largely unexamined. This paper evaluates state-of-the-art VLMs on a newly extended dataset covering seven languages and multimodal inputs, investigating their use of visual cues, language-specific performance, and cross-modality interactions. Our results show that VLMs generally rely more on text than images for stance detection and this trend persists across languages. Additionally, VLMs rely significantly more on text contained within the images than other visual content. Regarding multilinguality, the models studied tend to generate consistent predictions across languages whether they are explicitly multilingual or not, although there are outliers that are incongruous with macro F1, language support, and model size.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Sentiment Analysis
Cross-lingual Performance
Image-text Synergy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Sentiment Analysis
Cross-lingual Performance
Image-Embedded Text Recognition
🔎 Similar Papers
No similar papers found.