Online Learning with Recency: Algorithms for Sliding-window Streaming Multi-armed Bandits

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the sliding-window stochastic multi-armed bandit problem under stringent constraints of single-pass streaming and sublinear memory, where only the most recent $W$ arms remain active. It presents the first systematic analysis of the joint challenges posed by pure exploration and regret minimization in this setting, establishing that exact identification of the optimal arm is infeasible under sublinear memory, yet efficient approximation remains achievable. Assuming sub-Gaussian rewards, the authors propose a novel sampling-and-decision strategy that integrates sliding-window mechanisms with memory constraints and introduce a refined regret metric tailored to this dynamic environment. Theoretical results reveal tight trade-offs among memory budget, sample complexity, and regret, while empirical evaluations confirm both the practical efficacy of the proposed algorithm and the alignment of observed performance with theoretical bounds.

📝 Abstract

Motivated by the recency effect in online learning, we study algorithms for single-pass *sliding-window streaming multi-armed bandits (MABs)* in this paper. In this setting, we are given $n$ arms with unknown sub-Gaussian reward distributions and a parameter $W$. The arms arrive in a single-pass stream, and only the most recent $W$ arms are considered valid. The algorithm is required to perform pure exploration and regret minimization with limited memory, defined as the number of stored arms. The model is a natural extension of the streaming multi-armed bandits model (without the sliding window) that has been extensively studied in recent years. We provide a comprehensive analysis of both the pure exploration and regret minimization problems with the model. For pure exploration, we prove that finding the best arm is hard with sublinear memory while finding an approximate best arm admits an efficient algorithm. For regret minimization, we explore a new notion of regret and give sharp memory-regret trade-offs for any single-pass algorithm. We complement our theoretical results with experiments, demonstrating the trade-offs between sample, regret, and memory.

Problem

Research questions and friction points this paper is trying to address.

sliding-window

multi-armed bandits

online learning

recency

memory constraint

Innovation

Methods, ideas, or system contributions that make the work stand out.

sliding-window streaming

multi-armed bandits

online learning with recency