Multimodal Recommendation via Self-Corrective Preference Alignmen

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

In live-stream recommendation, misalignment between users’ dynamic preferences and streamers’ multimodal features (e.g., visual, audio, textual) leads to low accuracy and poor interpretability. To address this, we propose the Multimodal Self-Correcting Preference Alignment (MSPA) framework. First, a multimodal large language model (MLLM) encodes user tipping behavior into structured preference texts. Second, a self-correcting alignment mechanism dynamically fuses and iteratively refines the matching between user preferences and streamer multimodal representations. Unlike conventional methods relying on single-modal features or shallow multimodal concatenation, MSPA achieves semantic-level cross-modal alignment. Experiments on a real-world live-streaming dataset demonstrate that MSPA significantly outperforms state-of-the-art baselines in Recall@10, NDCG@10, and preference-text generation metrics (BLEU-4 and BERTScore), achieving both high recommendation accuracy and strong interpretability.

Technology Category

Application Category

📝 Abstract

With the rapid growth of live streaming platforms, personalized recommendation systems have become pivotal in improving user experience and driving platform revenue. The dynamic and multimodal nature of live streaming content (e.g., visual, audio, textual data) requires joint modeling of user behavior and multimodal features to capture evolving author characteristics. However, traditional methods relying on single-modal features or treating multimodal ones as supplementary struggle to align users' dynamic preferences with authors' multimodal attributes, limiting accuracy and interpretability. To address this, we propose MSPA (Multimodal Self-Corrective Preference Alignment), a personalized author recommendation framework with two components: (1) a Multimodal Preference Composer that uses MLLMs to generate structured preference text and embeddings from users' tipping history; and (2) a Self-Corrective Preference Alignment Recommender that aligns these preferences with authors' multimodal features to improve accuracy and interpretability. Extensive experiments and visualizations show that MSPA significantly improves accuracy, recall, and text quality, outperforming baselines in dynamic live streaming scenarios.

Problem

Research questions and friction points this paper is trying to address.

Aligning user preferences with authors' multimodal features

Improving recommendation accuracy in dynamic streaming scenarios

Addressing limitations of single-modal and supplementary multimodal methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MLLMs to generate structured preference embeddings

Self-corrective alignment of multimodal author features

Combines visual, audio, textual data for recommendation

🔎 Similar Papers

Improving Multi-modal Recommender Systems by Denoising and Aligning Multi-modal Content and User Feedback