🤖 AI Summary
To address the challenge of insufficient user preference modeling in conversational recommender systems (CRS) caused by short and sparse dialogue contexts, this paper proposes a multimodal semantic modeling approach that jointly leverages textual and visual modalities. Specifically, we construct modality-specific semantic graphs and—novelly—integrate multimodal graph-structured modeling with large language model (LLM) prompt learning. User preference representation is enhanced through high-order collaborative, textual, and visual modality associations. Our method comprises four components: multimodal feature extraction, modality-specific graph neural networks, cross-modal graph alignment and fusion, and prompt-based fine-tuning. Extensive experiments demonstrate significant improvements across multiple benchmarks: Recall@10 increases by 12.6%, BLEU-4 by 9.3%, and BERTScore by 8.1%. To foster reproducibility and further research, we publicly release both our source code and an extended multimodal CRS benchmark dataset.
📝 Abstract
Conversational Recommender Systems (CRSs) aim to provide personalized recommendations by interacting with users through conversations. Most existing studies of CRS focus on extracting user preferences from conversational contexts. However, due to the short and sparse nature of conversational contexts, it is difficult to fully capture user preferences by conversational contexts only. We argue that multi-modal semantic information can enrich user preference expressions from diverse dimensions (e.g., a user preference for a certain movie may stem from its magnificent visual effects and compelling storyline). In this paper, we propose a multi-modal semantic graph prompt learning framework for CRS, named MSCRS. First, we extract textual and image features of items mentioned in the conversational contexts. Second, we capture higher-order semantic associations within different semantic modalities (collaborative, textual, and image) by constructing modality-specific graph structures. Finally, we propose an innovative integration of multi-modal semantic graphs with prompt learning, harnessing the power of large language models to comprehensively explore high-dimensional semantic relationships. Experimental results demonstrate that our proposed method significantly improves accuracy in item recommendation, as well as generates more natural and contextually relevant content in response generation. We have released the code and the expanded multi-modal CRS datasets to facilitate further exploration in related researchfootnote{https://github.com/BIAOBIAO12138/MSCRS-main}.