🤖 AI Summary
Short-video recommendation faces three key challenges: cold start, positional bias, and duration bias; moreover, immersive feed interfaces exacerbate data skew through feedback loops. To address these, we propose an unsupervised multimodal retrieval framework: it employs a fine-tuned vision-language model (VLM) to generate semantic video embeddings—replacing supervised feature learning—and incorporates a fine-grained cross-modal alignment mechanism to enhance embedding fidelity. A lightweight vector retrieval system enables efficient candidate recall. By eliminating reliance on labeled data and mitigating bias amplification, our approach effectively breaks feedback loops. Online A/B tests on a major e-commerce platform demonstrate significant improvements: +23.6% exposure rate for new videos and +17.4% relative increase in average watch time—substantially outperforming supervised learning baselines.
📝 Abstract
In recent years, social media users have spent significant amounts of time on short-form video platforms. As a result, established platforms in other domains, such as e-commerce, have begun introducing short-form video content to engage users and increase their time spent on the platform. The success of these experiences is due not only to the content itself but also to a unique UI innovation: instead of offering users a list of choices to click, platforms actively recommend content for users to watch one at a time. This creates new challenges for recommender systems, especially when launching a new video experience. Beyond the limited interaction data, immersive feed experiences introduce stronger position bias due to the UI and duration bias when optimizing for watch-time, as models tend to favor shorter videos. These issues, together with the feedback loop inherent in recommender systems, make it difficult to build effective solutions. In this paper, we highlight the challenges faced when introducing a new short-form video experience and present our experience showing that, even with sufficient video interaction data, it can be more beneficial to leverage a video retrieval system using a fine-tuned multimodal vision-language model to overcome these challenges. This approach demonstrated greater effectiveness compared to conventional supervised learning methods in online experiments conducted on our e-commerce platform.