Impatient Bandits: Optimizing for the Long-Term Without Delay

๐Ÿ“… 2025-01-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Optimizing long-term user satisfaction in recommender systems remains challenging, particularly when reward feedback is delayed by weeks and short-term proxy signals induce myopic decision-making. Method: We propose a Bayesian delayed reward prediction model that fuses multi-source feedback and design a novel contextual bandit algorithm for joint exploration and prediction optimization. Contribution/Results: We introduce โ€œprogressive feedback valueโ€ โ€” an information-theoretic measure quantifying the quality of short-term signals โ€” and incorporate it explicitly into regret analysis, relaxing the conventional bandit assumption of either immediate or fully delayed rewards. Our theoretical regret bound depends explicitly on progressive feedback quality. Deployed on a podcast platform serving hundreds of millions of users, A/B testing shows statistically significant gains in long-term replay rate, outperforming both myopic optimization and pure delayed-reward baselines.

Technology Category

Application Category

๐Ÿ“ Abstract
Increasingly, recommender systems are tasked with improving users' long-term satisfaction. In this context, we study a content exploration task, which we formalize as a bandit problem with delayed rewards. There is an apparent trade-off in choosing the learning signal: waiting for the full reward to become available might take several weeks, slowing the rate of learning, whereas using short-term proxy rewards reflects the actual long-term goal only imperfectly. First, we develop a predictive model of delayed rewards that incorporates all information obtained to date. Rewards as well as shorter-term surrogate outcomes are combined through a Bayesian filter to obtain a probabilistic belief. Second, we devise a bandit algorithm that quickly learns to identify content aligned with long-term success using this new predictive model. We prove a regret bound for our algorithm that depends on the extit{Value of Progressive Feedback}, an information theoretic metric that captures the quality of short-term leading indicators that are observed prior to the long-term reward. We apply our approach to a podcast recommendation problem, where we seek to recommend shows that users engage with repeatedly over two months. We empirically validate that our approach significantly outperforms methods that optimize for short-term proxies or rely solely on delayed rewards, as demonstrated by an A/B test in a recommendation system that serves hundreds of millions of users.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Recommendation System
Delayed Feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

Delayed Reward
Hybrid Feedback
Reinforcement Learning
๐Ÿ”Ž Similar Papers
No similar papers found.