🤖 AI Summary
In live-stream recommendation, misalignment between users’ dynamic preferences and streamers’ multimodal features (e.g., visual, audio, textual) leads to low accuracy and poor interpretability. To address this, we propose the Multimodal Self-Correcting Preference Alignment (MSPA) framework. First, a multimodal large language model (MLLM) encodes user tipping behavior into structured preference texts. Second, a self-correcting alignment mechanism dynamically fuses and iteratively refines the matching between user preferences and streamer multimodal representations. Unlike conventional methods relying on single-modal features or shallow multimodal concatenation, MSPA achieves semantic-level cross-modal alignment. Experiments on a real-world live-streaming dataset demonstrate that MSPA significantly outperforms state-of-the-art baselines in Recall@10, NDCG@10, and preference-text generation metrics (BLEU-4 and BERTScore), achieving both high recommendation accuracy and strong interpretability.
📝 Abstract
With the rapid growth of live streaming platforms, personalized recommendation systems have become pivotal in improving user experience and driving platform revenue. The dynamic and multimodal nature of live streaming content (e.g., visual, audio, textual data) requires joint modeling of user behavior and multimodal features to capture evolving author characteristics. However, traditional methods relying on single-modal features or treating multimodal ones as supplementary struggle to align users' dynamic preferences with authors' multimodal attributes, limiting accuracy and interpretability. To address this, we propose MSPA (Multimodal Self-Corrective Preference Alignment), a personalized author recommendation framework with two components: (1) a Multimodal Preference Composer that uses MLLMs to generate structured preference text and embeddings from users' tipping history; and (2) a Self-Corrective Preference Alignment Recommender that aligns these preferences with authors' multimodal features to improve accuracy and interpretability. Extensive experiments and visualizations show that MSPA significantly improves accuracy, recall, and text quality, outperforming baselines in dynamic live streaming scenarios.