Multimodal Music Recommendation System using LLMs

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the limitations of traditional music recommendation systems, which treat songs as opaque identifiers and neglect their semantic and acoustic content, thereby hindering content-aware personalization. We propose the first unified large language model (LLM)-driven multimodal conversational recommendation framework that jointly models audio embeddings, lyrical semantics, LLM-generated metadata, and user listening completion rates within a sequential recommendation architecture, effectively integrating acoustic, semantic, and behavioral signals. To facilitate research in this direction, we construct and publicly release a large-scale multimodal music recommendation benchmark dataset and evaluate prominent LLMs—including LLaMA-2-13B, Qwen2.5-7B-Instruct, and LLaMA-3-70B—under zero-shot or fine-tuned settings. Experiments demonstrate substantial improvements over ID-based baselines, with up to 95% gain in Recall and 79% in NDCG, while also revealing the nontrivial challenge that naive multimodal fusion does not necessarily yield better performance.

📝 Abstract

Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semantic or acoustic content. Prior work has explored LLM-augmented, multimodal, and text-enhanced approaches to sequential recommendation, and while some methods partially combine semantic, acoustic, or engagement signals, none jointly model all three within a unified LLM-based sequential reasoning framework that grounds recommendations in actual song content. In this work, we propose a multimodal framework for session-based music recommendation that enriches the LastFM-1K dataset with three complementary signals: (1) audio and lyric embeddings extracted using pretrained music and text representation models, (2) LLM-generated semantic metadata using the MGPHot annotation schema, and (3) listening completion ratios. We adopt the E4SRec framework by extending it with multimodal features and different item ID encoder backbones, including SASRec, BERT4Rec, and GRU4Rec. We further extend the LLM backbone option with LLaMa-2-13B, Qwen2.5-7B-Instruct, and LLaMa-3-70B in both zero-shot and fine-tuned settings. Our experiments show that integrating content-based features improves over ID-only baselines up to 95% in terms of Recall and 79% in terms of NDCG. Moreover, our experiments show that naive multimodal fusion does not always yield additive improvements, highlighting challenges in cross-modal integration. We release a large-scale multimodal benchmark for music recommendation.

Problem

Research questions and friction points this paper is trying to address.

music recommendation

multimodal learning

large language models

sequential recommendation

content-based features

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal recommendation

large language models

music representation