🤖 AI Summary
This work addresses two key challenges in multimodal aspect-based sentiment analysis (MABSA): insufficient fusion of textual and visual information, and inadequate modeling of fine-grained cross-modal interactions. To this end, we propose an adaptive cross-modal attention mechanism that jointly performs context-aware dynamic modality weighting and adaptive cross-attention, enabling precise alignment and complementary enhancement between text and image features—thereby significantly improving aspect term extraction and sentiment classification. Our primary contribution is the first integration of context-adaptive weights into cross-modal attention, allowing the model to dynamically adjust textual and visual attention strengths based on semantic importance and thereby capture implicit cross-modal associations with higher fidelity. Evaluated on the standard Twitter multimodal benchmark, our method achieves a substantial F1-score improvement over existing state-of-the-art approaches, demonstrating superior robustness—particularly in challenging cases involving textual–visual inconsistency.
📝 Abstract
We introduce AdaptiSent, a new framework for Multimodal Aspect-Based Sentiment Analysis (MABSA) that uses adaptive cross-modal attention mechanisms to improve sentiment classification and aspect term extraction from both text and images. Our model integrates dynamic modality weighting and context-adaptive attention, enhancing the extraction of sentiment and aspect-related information by focusing on how textual cues and visual context interact. We tested our approach against several baselines, including traditional text-based models and other multimodal methods. Results from standard Twitter datasets show that AdaptiSent surpasses existing models in precision, recall, and F1 score, and is particularly effective in identifying nuanced inter-modal relationships that are crucial for accurate sentiment and aspect term extraction. This effectiveness comes from the model's ability to adjust its focus dynamically based on the context's relevance, improving the depth and accuracy of sentiment analysis across various multimodal data sets. AdaptiSent sets a new standard for MABSA, significantly outperforming current methods, especially in understanding complex multimodal information.