Short-Form Video Recommendations with Multimodal Embeddings: Addressing Cold-Start and Bias Challenges

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Short-video recommendation faces three key challenges: cold start, positional bias, and duration bias; moreover, immersive feed interfaces exacerbate data skew through feedback loops. To address these, we propose an unsupervised multimodal retrieval framework: it employs a fine-tuned vision-language model (VLM) to generate semantic video embeddings—replacing supervised feature learning—and incorporates a fine-grained cross-modal alignment mechanism to enhance embedding fidelity. A lightweight vector retrieval system enables efficient candidate recall. By eliminating reliance on labeled data and mitigating bias amplification, our approach effectively breaks feedback loops. Online A/B tests on a major e-commerce platform demonstrate significant improvements: +23.6% exposure rate for new videos and +17.4% relative increase in average watch time—substantially outperforming supervised learning baselines.

Technology Category

Application Category

📝 Abstract
In recent years, social media users have spent significant amounts of time on short-form video platforms. As a result, established platforms in other domains, such as e-commerce, have begun introducing short-form video content to engage users and increase their time spent on the platform. The success of these experiences is due not only to the content itself but also to a unique UI innovation: instead of offering users a list of choices to click, platforms actively recommend content for users to watch one at a time. This creates new challenges for recommender systems, especially when launching a new video experience. Beyond the limited interaction data, immersive feed experiences introduce stronger position bias due to the UI and duration bias when optimizing for watch-time, as models tend to favor shorter videos. These issues, together with the feedback loop inherent in recommender systems, make it difficult to build effective solutions. In this paper, we highlight the challenges faced when introducing a new short-form video experience and present our experience showing that, even with sufficient video interaction data, it can be more beneficial to leverage a video retrieval system using a fine-tuned multimodal vision-language model to overcome these challenges. This approach demonstrated greater effectiveness compared to conventional supervised learning methods in online experiments conducted on our e-commerce platform.
Problem

Research questions and friction points this paper is trying to address.

Overcoming cold-start challenges in short-form video recommendations
Addressing position and duration biases in immersive feed experiences
Improving recommendations using multimodal embeddings over supervised learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal embeddings for video retrieval
Fine-tuned vision-language model
Overcoming cold-start and bias challenges
🔎 Similar Papers
No similar papers found.
A
Andrii Dzhoha
Zalando SE, Berlin, Germany
Katya Mirylenka
Katya Mirylenka
Zalando Switzerland
Conversational Data ExplorationMachine LearningData AnalysisStatistics
E
Egor Malykh
Zalando SE, Berlin, Germany
M
Marco-Andrea Buchmann
Zalando Switzerland AG, Zürich, Switzerland
F
Francesca Catino
Zalando Switzerland AG, Zürich, Switzerland