Safely Exploring Novel Actions in Recommender Systems via Deployment-Efficient Policy Learning

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Offline policy learning (OPL) in recommendation systems struggles to balance safety and exploration when new items (i.e., new actions) continuously emerge. Method: We propose Safe OPG—a model-agnostic framework that leverages high-confidence offline policy evaluation to construct a dynamic safety boundary and incorporates a progressive relaxation mechanism, gradually loosening constraints across deployment rounds to facilitate exploration of novel actions. Contribution/Results: Unlike existing OPL methods, Safe OPG strictly maintains system safety—ensuring no statistically significant drop in key metrics such as conversion rate—while substantially improving new-item exposure and long-term user engagement. Empirical evaluations demonstrate near-100% safety compliance and superior exploration efficiency over state-of-the-art baselines, effectively mitigating the fundamental trade-off between safety and exploration.

Technology Category

Application Category

📝 Abstract

In many real recommender systems, novel items are added frequently over time. The importance of sufficiently presenting novel actions has widely been acknowledged for improving long-term user engagement. A recent work builds on Off-Policy Learning (OPL), which trains a policy from only logged data, however, the existing methods can be unsafe in the presence of novel actions. Our goal is to develop a framework to enforce exploration of novel actions with a guarantee for safety. To this end, we first develop Safe Off-Policy Policy Gradient (Safe OPG), which is a model-free safe OPL method based on a high confidence off-policy evaluation. In our first experiment, we observe that Safe OPG almost always satisfies a safety requirement, even when existing methods violate it greatly. However, the result also reveals that Safe OPG tends to be too conservative, suggesting a difficult tradeoff between guaranteeing safety and exploring novel actions. To overcome this tradeoff, we also propose a novel framework called Deployment-Efficient Policy Learning for Safe User Exploration, which leverages safety margin and gradually relaxes safety regularization during multiple (not many) deployments. Our framework thus enables exploration of novel actions while guaranteeing safe implementation of recommender systems.

Problem

Research questions and friction points this paper is trying to address.

Ensuring safe exploration of novel actions in recommender systems

Overcoming safety-violation risks in existing off-policy learning methods

Balancing safety guarantees with effective novel action exploration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Safe Off-Policy Policy Gradient for model-free learning

Deployment-efficient framework with safety margin

Gradually relaxed safety regularization during deployments

🔎 Similar Papers

Continuous Input Embedding Size Search For Recommender Systems