On VLMs for Diverse Tasks in Multimodal Meme Classification

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual-language models (VLMs) exhibit limited performance in multi-subtask meme classification—specifically irony, offensiveness, and sentiment—due to inherent modality fragmentation and insufficient cross-modal reasoning. Method: We propose a synergistic “VLM image understanding + LLM text understanding” framework. Our approach introduces a fine-grained meme explanation distillation mechanism to transfer VLMs’ cross-modal reasoning capabilities to a lightweight LLM; constructs a subtask-differentiated prompting benchmark; and systematically evaluates LoRA-based fine-tuning gains across VLM encoder, projector, and decoder components. Contribution/Results: Experiments demonstrate substantial accuracy improvements of 8.34%, 3.52%, and 26.24% on irony, offensiveness, and sentiment classification, respectively—significantly outperforming both standalone VLM and text-only baselines. The framework effectively mitigates modality isolation, establishing a new state-of-the-art for multimodal meme understanding.

Technology Category

Application Category

📝 Abstract
In this paper, we present a comprehensive and systematic analysis of vision-language models (VLMs) for disparate meme classification tasks. We introduced a novel approach that generates a VLM-based understanding of meme images and fine-tunes the LLMs on textual understanding of the embedded meme text for improving the performance. Our contributions are threefold: (1) Benchmarking VLMs with diverse prompting strategies purposely to each sub-task; (2) Evaluating LoRA fine-tuning across all VLM components to assess performance gains; and (3) Proposing a novel approach where detailed meme interpretations generated by VLMs are used to train smaller language models (LLMs), significantly improving classification. The strategy of combining VLMs with LLMs improved the baseline performance by 8.34%, 3.52% and 26.24% for sarcasm, offensive and sentiment classification, respectively. Our results reveal the strengths and limitations of VLMs and present a novel strategy for meme understanding.
Problem

Research questions and friction points this paper is trying to address.

Analyzing VLMs for diverse meme classification tasks
Improving performance via VLM-based meme and text understanding
Combining VLMs with LLMs to enhance classification accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLMs analyze meme images with diverse prompting
LoRA fine-tuning enhances VLM components
VLMs train smaller LLMs for better classification
D
Deepesh Gavit
Department of Data Science and Engineering, Indian Institute of Science Education and Research, Bhopal, India
Debajyoti Mazumder
Debajyoti Mazumder
Indian Institute of Science Education and Research Bhopal
NLP
S
Samiran Das
Department of Data Science and Engineering, Indian Institute of Science Education and Research, Bhopal, India
Jasabanta Patro
Jasabanta Patro
Assistant Professor, DSE, IISER Bhopal
NLPSocial Computing