🤖 AI Summary
This work addresses the limitations of existing watch-time prediction methods, which often suffer from unimodal assumptions, discretization errors, or high inference latency, thereby failing to capture the multimodal and heterogeneous nature of user-item interactions. From a causal perspective, the authors treat user-specific interaction patterns as structural confounders and introduce a novel continuous generative regression paradigm. They design a personalized prior based on normalizing flows and employ a one-step generative variational autoencoder to map a standard Gaussian prior onto a history-conditioned, complex manifold, enabling continuous latent space modeling. Contributions include TimeRec—the first open-source library for watch-time prediction—personalized evaluation metrics, and the FlowTime model. Both offline experiments and online A/B tests demonstrate that FlowTime significantly outperforms state-of-the-art methods in prediction accuracy and inference efficiency.
📝 Abstract
Watch time has emerged as a pivotal metric for optimizing deep user engagement in short-video recommender systems. However, current methods of watch time prediction (WTP) suffer from inherent paradigm-specific limitations. Direct Regression faces mean-collapse due to unimodal Gaussian assumptions, while Ordinal Regression is hampered by quantization errors from rigid discretization. Similarly, Discrete Generative Regression struggles with high inference latency and heuristic vocabulary design. Beyond these specific flaws, a shared deficiency is the inability to capture the intrinsic multimodality and heterogeneity of User-Item Interaction Patterns. To address these challenges, we first revisit the WTP problem from a causal perspective and identify these user-specific patterns as structural confounders that modulate watch time outcomes, where identical interests manifest as distinct watch time outcomes conditioned on diverse user habits. Then, we formally propose a new (or the fourth) paradigm -- Continuous Generative Regression, and introduce FlowTime, a novel method utilizing a One-step Generative Variational Autoencoder. FlowTime effectively circumvents the latency of iterative denoising while maintaining the expressivity of continuous latent spaces. Furthermore, we design a Flow-based Personalized Prior that leverages NFs to warp a standard Gaussian prior into a complex, history-conditioned manifold, thereby enabling the adaptive modeling of multimodal interaction patterns. Finally, we build TimeRec, the first open-source WTP Library, alongside a novel personalization metric to establish a rigorous benchmarking standard. Extensive offline experiments and online A/B tests demonstrate FlowTime's significant superiority over SOTA methods.