Augmenting Limited and Biased RCTs through Pseudo-Sample Matching-Based Observational Data Fusion Method

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

In online mobility pricing, randomized controlled trial (RCT) data are sparse (only 0.65% of traffic) and suffer from selection bias, network interference, and treatment effect heterogeneity, leading to substantial estimation bias in uplift modeling. Existing data fusion approaches struggle in industrial settings due to reliance on strong functional assumptions and high-dimensional feature engineering. This paper proposes a pseudo-sample matching framework: counterfactual pseudo-samples are generated from observational data and matched to RCT units via nearest-neighbor matching, enabling robust integration of RCT and observational data without stringent parametric assumptions or extensive feature engineering. The method significantly improves RCT representativeness and precision of causal effect estimation. A one-week online experiment demonstrates its effectiveness, yielding a 0.41% profit uplift at a revenue scale of tens of billions—establishing a scalable, production-ready paradigm for causal inference under sparse experimental regimes.

Technology Category

Application Category

📝 Abstract

In the online ride-hailing pricing context, companies often conduct randomized controlled trials (RCTs) and utilize uplift models to assess the effect of discounts on customer orders, which substantially influences competitive market outcomes. However, due to the high cost of RCTs, the proportion of trial data relative to observational data is small, which only accounts for 0.65% of total traffic in our context, resulting in significant bias when generalizing to the broader user base. Additionally, the complexity of industrial processes reduces the quality of RCT data, which is often subject to heterogeneity from potential interference and selection bias, making it difficult to correct. Moreover, existing data fusion methods are challenging to implement effectively in complex industrial settings due to the high dimensionality of features and the strict assumptions that are hard to verify with real-world data. To address these issues, we propose an empirical data fusion method called pseudo-sample matching. By generating pseudo-samples from biased, low-quality RCT data and matching them with the most similar samples from large-scale observational data, the method expands the RCT dataset while mitigating its heterogeneity. We validated the method through simulation experiments, conducted offline and online tests using real-world data. In a week-long online experiment, we achieved a 0.41% improvement in profit, which is a considerable gain when scaled to industrial scenarios with hundreds of millions in revenue. In addition, we discuss the harm to model training, offline evaluation, and online economic benefits when the RCT data quality is not high, and emphasize the importance of improving RCT data quality in industrial scenarios. Further details of the simulation experiments can be found in the GitHub repository https://github.com/Kairong-Han/Pseudo-Matching.

Problem

Research questions and friction points this paper is trying to address.

Addressing biased RCT data limitations in ride-hailing pricing experiments

Mitigating heterogeneity and selection bias in small-scale trial data

Improving data fusion methods for complex industrial applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pseudo-sample matching expands biased RCT data

Matches RCT samples with observational data counterparts

Mitigates heterogeneity through similarity-based data fusion

🔎 Similar Papers

A Double Machine Learning Approach to Combining Experimental and Observational Data