π€ AI Summary
To address the limitation of unimodal interest modeling in capturing usersβ complex and evolving preferences on short-video platforms, this paper proposes a dynamic multimodal interest modeling framework grounded in foundation models. The framework jointly encodes video frames, textual descriptions, and background audio, aligning them into a unified semantic space via cross-modal alignment; it further introduces a behavior-driven dynamic feature embedding mechanism to enable fine-grained interest representation and temporal evolution tracking. Additionally, an interpretable attention mechanism coupled with feature visualization enhances cold-start user modeling and improves recommendation transparency. Extensive experiments on both public and proprietary datasets demonstrate significant improvements: +8.2% in click-through rate, +6.7% in behavioral prediction accuracy, and notably stronger performance and interpretability for cold-start recommendations.
π Abstract
With the rapid expansion of user bases on short video platforms, personalized recommendation systems are playing an increasingly critical role in enhancing user experience and optimizing content distribution. Traditional interest modeling methods often rely on unimodal data, such as click logs or text labels, which limits their ability to fully capture user preferences in a complex multimodal content environment. To address this challenge, this paper proposes a multimodal foundation model-based framework for user interest modeling and behavior analysis. By integrating video frames, textual descriptions, and background music into a unified semantic space using cross-modal alignment strategies, the framework constructs fine-grained user interest vectors. Additionally, we introduce a behavior-driven feature embedding mechanism that incorporates viewing, liking, and commenting sequences to model dynamic interest evolution, thereby improving both the timeliness and accuracy of recommendations. In the experimental phase, we conduct extensive evaluations using both public and proprietary short video datasets, comparing our approach against multiple mainstream recommendation algorithms and modeling techniques. Results demonstrate significant improvements in behavior prediction accuracy, interest modeling for cold-start users, and recommendation click-through rates. Moreover, we incorporate interpretability mechanisms using attention weights and feature visualization to reveal the model's decision basis under multimodal inputs and trace interest shifts, thereby enhancing the transparency and controllability of the recommendation system.