Off-Policy Evaluation and Learning for the Future under Non-Stationarity

๐Ÿ“… 2025-06-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses off-policy evaluation (F-OPE) and learning (F-OPL) for future policies under non-stationary environmentsโ€”i.e., accurately estimating and optimizing policy value at future time points (e.g., next month) using only historical data. Existing methods suffer from high bias due to reliance on stationarity assumptions or strong parametric reward modeling. To overcome this, we propose the first Off-Policy Future Value (OPFV) estimator that explicitly incorporates time-series structures (e.g., seasonality, periodicity) and introduces a time-aware importance weighting mechanism, enabling low-bias extrapolation without access to future observations. We further unify non-stationary dynamics modeling with policy gradient optimization into a single offline framework generalizable to unseen future time points. Experiments across diverse non-stationary settings demonstrate that our approach significantly outperforms baselines, achieving substantial improvements in both future policy value estimation accuracy and optimization performance.

Technology Category

Application Category

๐Ÿ“ Abstract
We study the novel problem of future off-policy evaluation (F-OPE) and learning (F-OPL) for estimating and optimizing the future value of policies in non-stationary environments, where distributions vary over time. In e-commerce recommendations, for instance, our goal is often to estimate and optimize the policy value for the upcoming month using data collected by an old policy in the previous month. A critical challenge is that data related to the future environment is not observed in the historical data. Existing methods assume stationarity or depend on restrictive reward-modeling assumptions, leading to significant bias. To address these limitations, we propose a novel estimator named extit{ extbf{O}ff- extbf{P}olicy Estimator for the extbf{F}uture extbf{V}alue ( extbf{ extit{OPFV}})}, designed for accurately estimating policy values at any future time point. The key feature of OPFV is its ability to leverage the useful structure within time-series data. While future data might not be present in the historical log, we can leverage, for example, seasonal, weekly, or holiday effects that are consistent in both the historical and future data. Our estimator is the first to exploit these time-related structures via a new type of importance weighting, enabling effective F-OPE. Theoretical analysis identifies the conditions under which OPFV becomes low-bias. In addition, we extend our estimator to develop a new policy-gradient method to proactively learn a good future policy using only historical data. Empirical results show that our methods substantially outperform existing methods in estimating and optimizing the future policy value under non-stationarity for various experimental setups.
Problem

Research questions and friction points this paper is trying to address.

Estimating future policy values in non-stationary environments
Optimizing policies for future time points without future data
Addressing bias in existing methods by leveraging time-series structures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages time-series data structure
Introduces OPFV for future value estimation
Uses novel importance weighting technique
๐Ÿ”Ž Similar Papers
No similar papers found.