🤖 AI Summary
Image sentiment analysis has traditionally relied solely on visual features, neglecting the systematic modeling and integration of multimodal metadata—such as textual descriptions and keyword tags—that convey rich semantic and affective cues.
Method: We propose the first metadata-augmented Transformer framework that unifies visual features with heterogeneous textual metadata (descriptions and tags). Our approach introduces an adaptive relevance learning module to dynamically weight the importance of diverse metadata sources and a cross-modal fusion module for fine-grained sentiment prediction. Additionally, we incorporate a multimodal representation alignment mechanism to enhance semantic consistency across modalities.
Contribution/Results: Evaluated on three benchmark public datasets, our method achieves significant improvements over existing state-of-the-art approaches. Results demonstrate that explicit modeling of multimodal metadata substantially enhances both performance and robustness in image sentiment understanding, establishing its critical role in advancing affective vision systems.
📝 Abstract
As more and more internet users post images online to express their daily emotions, image sentiment analysis has attracted increasing attention. Recently, researchers generally tend to design different neural networks to extract visual features from images for sentiment analysis. Despite the significant progress, metadata, the data (e.g., text descriptions and keyword tags) for describing the image, has not been sufficiently explored in this task. In this paper, we propose a novel Metadata Enhanced Transformer for sentiment analysis (SentiFormer) to fuse multiple metadata and the corresponding image into a unified framework. Specifically, we first obtain multiple metadata of the image and unify the representations of diverse data. To adaptively learn the appropriate weights for each metadata, we then design an adaptive relevance learning module to highlight more effective information while suppressing weaker ones. Moreover, we further develop a cross-modal fusion module to fuse the adaptively learned representations and make the final prediction. Extensive experiments on three publicly available datasets demonstrate the superiority and rationality of our proposed method.