ðĪ AI Summary
This work addresses the challenge of predicting offsite conversion rate (OCVR), where conversion signals are sparse, delayed, and difficult to attribute, in contrast to dense click signals. To this end, the authors propose a heterogeneous dual-stream pretraining architecture that employs dedicated Transformer encoders tailored to the distinct statistical characteristics of click and conversion sequences. Specifically, the click stream utilizes multi-layer self-attention, while the conversion stream alternates between cross-attention and self-attention. The embeddings from both streams are then fused for downstream ranking models. This approach enables, for the first time, accurate joint modeling of both signal types under strict online latency constraints. Experimental results demonstrate a maximum 0.38% reduction in offline normalized entropy (NE), and A/B tests confirm a significant improvement in OCVR prediction accuracy.
ð Abstract
Offsite conversion rate (OCVR) prediction is an important ranking problem in computational recommendation systems. This task presents a modeling challenge: click signals are abundant and exhibit short temporal horizons, whereas conversion signals are inherently sparse, long-delayed, and frequently unattributed. Despite these statistical disparities, both signal types must inform models that operate within strict serving-latency constraints. Prior pre-training approaches address this heterogeneity with a single, undifferentiated encoder applied uniformly across both data streams. We propose DUET (Dual User Embedding Transformers), a framework that explicitly partitions user behavioral data into two domain-coherent streams -- clicks and conversions -- and pre-trains dedicated transformer encoders with architectures tailored to each stream's statistical characteristics: multi-layer self-attention for the dense click stream and interleaved cross- and self-attention for the sparse conversion stream. The resulting complementary embeddings are jointly consumed by a downstream ranker without exceeding serving-latency budgets. Evaluation demonstrates up to 0.38% normalized entropy (NE) reduction relative to the strongest baseline, and A/B test shows consistent improvements in OCVR prediction accuracy.