Multimodal Recommendation via Self-Corrective Preference Alignmen

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In live-stream recommendation, misalignment between users’ dynamic preferences and streamers’ multimodal features (e.g., visual, audio, textual) leads to low accuracy and poor interpretability. To address this, we propose the Multimodal Self-Correcting Preference Alignment (MSPA) framework. First, a multimodal large language model (MLLM) encodes user tipping behavior into structured preference texts. Second, a self-correcting alignment mechanism dynamically fuses and iteratively refines the matching between user preferences and streamer multimodal representations. Unlike conventional methods relying on single-modal features or shallow multimodal concatenation, MSPA achieves semantic-level cross-modal alignment. Experiments on a real-world live-streaming dataset demonstrate that MSPA significantly outperforms state-of-the-art baselines in Recall@10, NDCG@10, and preference-text generation metrics (BLEU-4 and BERTScore), achieving both high recommendation accuracy and strong interpretability.

Technology Category

Application Category

📝 Abstract
With the rapid growth of live streaming platforms, personalized recommendation systems have become pivotal in improving user experience and driving platform revenue. The dynamic and multimodal nature of live streaming content (e.g., visual, audio, textual data) requires joint modeling of user behavior and multimodal features to capture evolving author characteristics. However, traditional methods relying on single-modal features or treating multimodal ones as supplementary struggle to align users' dynamic preferences with authors' multimodal attributes, limiting accuracy and interpretability. To address this, we propose MSPA (Multimodal Self-Corrective Preference Alignment), a personalized author recommendation framework with two components: (1) a Multimodal Preference Composer that uses MLLMs to generate structured preference text and embeddings from users' tipping history; and (2) a Self-Corrective Preference Alignment Recommender that aligns these preferences with authors' multimodal features to improve accuracy and interpretability. Extensive experiments and visualizations show that MSPA significantly improves accuracy, recall, and text quality, outperforming baselines in dynamic live streaming scenarios.
Problem

Research questions and friction points this paper is trying to address.

Aligning user preferences with authors' multimodal features
Improving recommendation accuracy in dynamic streaming scenarios
Addressing limitations of single-modal and supplementary multimodal methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MLLMs to generate structured preference embeddings
Self-corrective alignment of multimodal author features
Combines visual, audio, textual data for recommendation
🔎 Similar Papers
No similar papers found.
Y
Yalong Guan
Kuaishou Technology, Beijing, China
X
Xiang Chen
Kuaishou Technology, Beijing, China
Mingyang Wang
Mingyang Wang
University of Munich (LMU Munich)
Natural Language Processing
X
Xiangyu Wu
Kuaishou Technology, Beijing, China
Lihao Liu
Lihao Liu
Amazon
LLM-based AgentHealthcare AI
C
Chao Qi
Kuaishou Technology, Beijing, China
S
Shuang Yang
Kuaishou Technology, Beijing, China
T
Tingting Gao
Kuaishou Technology, Beijing, China
Guorui Zhou
Guorui Zhou
Unknown affiliation
Recommender System,Advertising,Artificial Intelligence,Machine Learning,NLP
Changjian Chen
Changjian Chen
Associate Professor, Hunan University
Interactive Machine LearningData-Centric AI