Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper exposes a severe exploration-exploitation bias in offline evaluation of multi-armed bandit (MAB) algorithms for recommender systems: existing protocols systematically overestimate exploitation while underestimating exploration value. Through a large-scale empirical study across 90+ real-world datasets—rigorously adhering to standard offline evaluation protocols within the contextual linear bandit framework and incorporating thorough hyperparameter optimization—we find that a pure greedy (no-exploration) model outperforms or matches state-of-the-art exploration algorithms on over 90% of datasets. This challenges the foundational assumption that offline evaluation reliably validates exploration strategies, revealing a structural flaw in the current paradigm. The core contribution is the first quantitative demonstration of systematic underestimation of exploration efficacy in offline evaluation. The work calls for a new evaluation paradigm capable of capturing the long-term value of exploration, moving beyond static, myopic reward estimation.

Technology Category

Application Category

📝 Abstract
Multi-Armed Bandit (MAB) algorithms are widely used in recommender systems that require continuous, incremental learning. A core aspect of MABs is the exploration-exploitation trade-off: choosing between exploiting items likely to be enjoyed and exploring new ones to gather information. In contextual linear bandits, this trade-off is particularly central, as many variants share the same linear regression backbone and differ primarily in their exploration strategies. Despite its prevalent use, offline evaluation of MABs is increasingly recognized for its limitations in reliably assessing exploration behavior. This study conducts an extensive offline empirical comparison of several linear MABs. Strikingly, across over 90% of various datasets, a greedy linear model, with no type of exploration, consistently achieves top-tier performance, often outperforming or matching its exploratory counterparts. This observation is further corroborated by hyperparameter optimization, which consistently favors configurations that minimize exploration, suggesting that pure exploitation is the dominant strategy within these evaluation settings. Our results expose significant inadequacies in offline evaluation protocols for bandits, particularly concerning their capacity to reflect true exploratory efficacy. Consequently, this research underscores the urgent necessity for developing more robust assessment methodologies, guiding future investigations into alternative evaluation frameworks for interactive learning in recommender systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating exploration bias in linear bandit recommender offline evaluation
Comparing greedy vs exploratory linear MABs in offline settings
Identifying limitations of offline evaluation for bandit exploration efficacy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Greedy linear model outperforms exploratory bandits
Hyperparameter optimization minimizes exploration strategies
Offline evaluation inadequately assesses exploration efficacy
🔎 Similar Papers
No similar papers found.