Multimodal Foundation Model-Driven User Interest Modeling and Behavior Analysis on Short Video Platforms

πŸ“… 2025-09-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the limitation of unimodal interest modeling in capturing users’ complex and evolving preferences on short-video platforms, this paper proposes a dynamic multimodal interest modeling framework grounded in foundation models. The framework jointly encodes video frames, textual descriptions, and background audio, aligning them into a unified semantic space via cross-modal alignment; it further introduces a behavior-driven dynamic feature embedding mechanism to enable fine-grained interest representation and temporal evolution tracking. Additionally, an interpretable attention mechanism coupled with feature visualization enhances cold-start user modeling and improves recommendation transparency. Extensive experiments on both public and proprietary datasets demonstrate significant improvements: +8.2% in click-through rate, +6.7% in behavioral prediction accuracy, and notably stronger performance and interpretability for cold-start recommendations.

Technology Category

Application Category

πŸ“ Abstract
With the rapid expansion of user bases on short video platforms, personalized recommendation systems are playing an increasingly critical role in enhancing user experience and optimizing content distribution. Traditional interest modeling methods often rely on unimodal data, such as click logs or text labels, which limits their ability to fully capture user preferences in a complex multimodal content environment. To address this challenge, this paper proposes a multimodal foundation model-based framework for user interest modeling and behavior analysis. By integrating video frames, textual descriptions, and background music into a unified semantic space using cross-modal alignment strategies, the framework constructs fine-grained user interest vectors. Additionally, we introduce a behavior-driven feature embedding mechanism that incorporates viewing, liking, and commenting sequences to model dynamic interest evolution, thereby improving both the timeliness and accuracy of recommendations. In the experimental phase, we conduct extensive evaluations using both public and proprietary short video datasets, comparing our approach against multiple mainstream recommendation algorithms and modeling techniques. Results demonstrate significant improvements in behavior prediction accuracy, interest modeling for cold-start users, and recommendation click-through rates. Moreover, we incorporate interpretability mechanisms using attention weights and feature visualization to reveal the model's decision basis under multimodal inputs and trace interest shifts, thereby enhancing the transparency and controllability of the recommendation system.
Problem

Research questions and friction points this paper is trying to address.

Modeling user interests in short video platforms using multimodal data
Improving recommendation accuracy by capturing dynamic interest evolution
Enhancing interpretability and transparency of recommendation systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal foundation model integrates video, text, audio
Behavior-driven embedding captures dynamic interest evolution
Cross-modal alignment creates unified semantic user vectors
πŸ”Ž Similar Papers
No similar papers found.
Yushang Zhao
Yushang Zhao
Washington University in St. Louis
Artificial IntelligenceNLPLLMRecommendationDigital Marketing
Q
Qianyi Sun
Vanderbilt University
Y
Yike Peng
Graduate School of Arts and Sciences, Columbia University
Z
Zhihui Zhang
Graduate School of Arts and Sciences, Boston University
L
Li Zhang
Amazon
Yingying Zhuang
Yingying Zhuang
Amazon
StatisticsMachine LearningCausal Inference