Off-Policy Evaluation for Ranking Policies under Deterministic Logging Policies

📅 2026-03-22

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the severe bias in existing off-policy evaluation methods under fully deterministic logging policies, which stems from insufficient exploration. To overcome this limitation, the paper proposes the Click-based Inverse Propensity Score (CIPS) estimator, which leverages the inherent randomness in user click behavior to construct importance weights, thereby eliminating the conventional reliance on stochasticity in the logging policy. Theoretical analysis demonstrates that CIPS achieves lower bias and controllable variance. Extensive experiments on both synthetic and real-world datasets confirm that CIPS significantly outperforms strong baseline methods, yielding substantially improved evaluation accuracy while maintaining stability.

Technology Category

Application Category

📝 Abstract

Off-Policy Evaluation (OPE) is an important practical problem in algorithmic ranking systems, where the goal is to estimate the expected performance of a new ranking policy using only offline logged data collected under a different, logging policy. Existing estimators, such as the ranking-wise and position-wise inverse propensity score (IPS) estimators, require the data collection policy to be sufficiently stochastic and suffer from severe bias when the logging policy is fully deterministic. In this paper, we propose novel estimators, Click-based Inverse Propensity Score (CIPS), exploiting the intrinsic stochasticity of user click behavior to address this challenge. Unlike existing methods that rely on the stochasticity of the logging policy, our approach uses click probability as a new form of importance weighting, enabling low-bias OPE even under deterministic logging policies where existing methods incur substantial bias. We provide theoretical analyses of the bias and variance properties of the proposed estimators and show, through synthetic and real-world experiments, that our estimators achieve significantly lower bias compared to strong baselines, for a range of experimental settings with completely deterministic logging policies.

Problem

Research questions and friction points this paper is trying to address.

Off-Policy Evaluation

Ranking Policies

Deterministic Logging Policies

Inverse Propensity Score

Click Behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

Off-Policy Evaluation

Deterministic Logging Policy

Click-based Inverse Propensity Score