MLLMRec: Exploring the Potential of Multimodal Large Language Models in Recommender Systems

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing multimodal recommendation systems face two key bottlenecks: (1) user multimodal representation initialization is vulnerable to behavioral blind spots or noisy interactions; and (2) item graphs constructed via k-nearest neighbors (k-NN) contain low-similarity noisy edges and neglect audience co-occurrence patterns. To address these, we propose MLLMRec—the first recommendation framework integrating Multimodal Large Language Models (MLLMs). MLLMRec leverages image-to-text generation to model fine-grained behavioral semantics, enabling semantic purification of user preferences. It further reconstructs a high-confidence, co-occurrence-aware item graph via joint threshold-based denoising and topology-aware enhancement. Extensive experiments on three public benchmarks demonstrate that MLLMRec significantly outperforms state-of-the-art baselines, achieving an average performance gain of 38.53%. This validates the effectiveness of MLLM-driven semantic alignment coupled with collaborative graph structural optimization.

Technology Category

Application Category

📝 Abstract

Multimodal recommendation typically combines the user behavioral data with the modal features of items to reveal user's preference, presenting superior performance compared to the conventional recommendations. However, existing methods still suffer from two key problems: (1) the initialization methods of user multimodal representations are either behavior-unperceived or noise-contaminated, and (2) the KNN-based item-item graph contains noisy edges with low similarities and lacks audience co-occurrence relationships. To address such issues, we propose MLLMRec, a novel MLLM-driven multimodal recommendation framework with two item-item graph refinement strategies. On the one hand, the item images are first converted into high-quality semantic descriptions using an MLLM, which are then fused with the textual metadata of items. Then, we construct a behavioral description list for each user and feed it into the MLLM to reason about the purified user preference containing interaction motivations. On the other hand, we design the threshold-controlled denoising and topology-aware enhancement strategies to refine the suboptimal item-item graph, thereby enhancing the item representation learning. Extensive experiments on three publicly available datasets demonstrate that MLLMRec achieves the state-of-the-art performance with an average improvement of 38.53% over the best baselines.

Problem

Research questions and friction points this paper is trying to address.

Refining user multimodal representations to remove noise and incorporate behavior

Enhancing item-item graphs by eliminating low-similarity edges and adding co-occurrence relationships

Improving multimodal recommendation accuracy through MLLM-driven framework and graph refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MLLM to generate semantic item descriptions

Refines item graphs with denoising and enhancement strategies

Purifies user preferences through behavioral description analysis

🔎 Similar Papers

MMREC: LLM Based Multi-Modal Recommender System

2024-08-08International Workshop on Semantic and Social Media Adaptation and PersonalizationCitations: 13

Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation

2024-08-19arXiv.orgCitations: 1

Authors to Follow