🤖 AI Summary
In two-sided matching markets (e.g., hiring, dating platforms), bidirectional user interactions induce sparse and highly variable rewards, rendering standard offline policy evaluation (OPE) methods ineffective. To address this, we propose a novel doubly robust estimation framework specifically designed for matching settings. Our approach innovatively incorporates intermediate feedback signals—such as user engagement or partial match outcomes—and unifies direct modeling (DM), inverse propensity scoring (IPS), and doubly robust (DR) techniques to construct two estimators: DiPS (Direct + IPS) and DPR (Direct + Propensity-weighted Regression). Both ensure unbiasedness while substantially reducing variance. The method operates solely on naturally logged platform data, enabling efficient offline policy evaluation and learning. Experiments on synthetic benchmarks and real-world A/B test logs from a large-scale recruitment platform demonstrate that our estimators achieve significantly higher evaluation accuracy and superior policy optimization performance compared to state-of-the-art OPE baselines.
📝 Abstract
Matching users based on mutual preferences is a fundamental aspect of services driven by reciprocal recommendations, such as job search and dating applications. Although A/B tests remain the gold standard for evaluating new policies in recommender systems for matching markets, it is costly and impractical for frequent policy updates. Off-Policy Evaluation (OPE) thus plays a crucial role by enabling the evaluation of recommendation policies using only offline logged data naturally collected on the platform. However, unlike conventional recommendation settings, the large scale and bidirectional nature of user interactions in matching platforms introduce variance issues and exacerbate reward sparsity, making standard OPE methods unreliable. To address these challenges and facilitate effective offline evaluation, we propose novel OPE estimators, extit{DiPS} and extit{DPR}, specifically designed for matching markets. Our methods combine elements of the Direct Method (DM), Inverse Propensity Score (IPS), and Doubly Robust (DR) estimators while incorporating intermediate labels, such as initial engagement signals, to achieve better bias-variance control in matching markets. Theoretically, we derive the bias and variance of the proposed estimators and demonstrate their advantages over conventional methods. Furthermore, we show that these estimators can be seamlessly extended to offline policy learning methods for improving recommendation policies for making more matches. We empirically evaluate our methods through experiments on both synthetic data and A/B testing logs from a real job-matching platform. The empirical results highlight the superiority of our approach over existing methods in off-policy evaluation and learning tasks for a variety of configurations.