Do Multimodal LLMs See Sentiment?

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Emotion understanding in social media visual content faces challenges including complex scene semantics and poor generalization. This paper proposes MLLMsent, the first framework to systematically evaluate multimodal large language models’ (MLLMs) emotional reasoning capabilities from three complementary perspectives: direct image classification, generative descriptive sentiment analysis, and supervised fine-tuning. Our method integrates automatic image captioning, pretrained language models, and cross-dataset zero-shot evaluation—enabling strong generalization without target-domain training. On multiple benchmarks, MLLMsent achieves state-of-the-art performance, outperforming traditional methods by up to 64.8%. Under cross-dataset zero-shot settings, it surpasses the best baseline by 8.26%, demonstrating MLLMs’ significant potential and robustness for fine-grained visual emotion understanding.

Technology Category

Application Category

📝 Abstract
Understanding how visual content communicates sentiment is critical in an era where online interaction is increasingly dominated by this kind of media on social platforms. However, this remains a challenging problem, as sentiment perception is closely tied to complex, scene-level semantics. In this paper, we propose an original framework, MLLMsent, to investigate the sentiment reasoning capabilities of Multimodal Large Language Models (MLLMs) through three perspectives: (1) using those MLLMs for direct sentiment classification from images; (2) associating them with pre-trained LLMs for sentiment analysis on automatically generated image descriptions; and (3) fine-tuning the LLMs on sentiment-labeled image descriptions. Experiments on a recent and established benchmark demonstrate that our proposal, particularly the fine-tuned approach, achieves state-of-the-art results outperforming Lexicon-, CNN-, and Transformer-based baselines by up to 30.9%, 64.8%, and 42.4%, respectively, across different levels of evaluators' agreement and sentiment polarity categories. Remarkably, in a cross-dataset test, without any training on these new data, our model still outperforms, by up to 8.26%, the best runner-up, which has been trained directly on them. These results highlight the potential of the proposed visual reasoning scheme for advancing affective computing, while also establishing new benchmarks for future research.
Problem

Research questions and friction points this paper is trying to address.

Investigating sentiment reasoning in Multimodal LLMs
Evaluating visual sentiment classification from images
Improving sentiment analysis on image descriptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLMsent framework for sentiment classification
Fine-tuning LLMs on sentiment-labeled descriptions
Cross-dataset testing without additional training
🔎 Similar Papers
No similar papers found.
N
Neemias B. da Silva
Universidade Tecnológica Federal do Paraná (UTFPR), Brazil
J
John Harrison
Universidade Tecnológica Federal do Paraná (UTFPR), Brazil
Rodrigo Minetto
Rodrigo Minetto
Federal University of Technology of Paraná, DAINF, Curitiba, Brazil
Image Processing and Computer Vision
M
Myriam R. Delgado
Universidade Tecnológica Federal do Paraná (UTFPR), Brazil
B
Bogdan T. Nassu
Universidade Tecnológica Federal do Paraná (UTFPR), Brazil
T
Thiago H. Silva
University of Toronto, Canada