Teach Multimodal Recommendation Model to See via Personalized Visual Extraction and Adaptive Learning

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing multimodal sequential recommendation methods often underutilize visual features, leading to models dominated by textual signals and unable to effectively capture visual cues relevant to user preferences. To address this limitation, this work proposes REVEAL, a plug-and-play framework that enhances visual representation without modifying the backbone recommendation model. REVEAL integrates feedback-guided visual extraction (FVE) and adaptive visual learning (AVL) to optimize visual prompts in a task-aware manner and dynamically adjust modality weights during both training and inference. This approach significantly amplifies the contribution of visual information to the recommendation process. Extensive experiments demonstrate that REVEAL consistently improves recommendation performance across multiple real-world datasets and diverse backbone architectures, while effectively attending to visual regions aligned with user preferences.

📝 Abstract

Multimodal sequential recommendation (MSR) incorporates textual and visual information to improve recommendation quality. However, recent studies and our empirical analysis show that visual features are often underutilized, thereby contributing far less than textual signals. We attribute this issue to two factors: insufficient visual representation learning (pretrained encoders fail to capture preference-relevant cues) and unbalanced visual-text optimization (textual features dominate the learning process). To address these issues, we propose Teach Multimodal Recommendation Model to See via Personalized Visual Extraction and Adaptive Learning (REVEAL), a plug-and-play framework that enhances visual representation learning and cross-modal optimization without modifying the original recommendation backbone. REVEAL consists of Feedback-Guided Visual Extraction (FVE), which refines prompt-guided visual extraction through task-level feedback, and Adaptive Visual Learning (AVL), which dynamically reweights visual learning to alleviate modality imbalance. Experiments on multiple real-world datasets and MSR backbones demonstrate that REVEAL consistently improves recommendation performance. Further analysis shows that these gains arise from more effective attention to preference-relevant visual regions and better visual utilization during training. The code is available at https://github.com/YutongLi2024/REVEAL.

Problem

Research questions and friction points this paper is trying to address.

multimodal sequential recommendation

visual underutilization

visual representation learning

modality imbalance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Sequential Recommendation

Visual Representation Learning

Modality Imbalance

Prompt-Guided Extraction