Scaling Transformers for Discriminative Recommendation via Generative Pretraining

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

To address severe overfitting and performance degradation in industrial recommendation systems—particularly for discriminative tasks like CTR/CVR prediction—caused by data sparsity in large models, this paper proposes GPSD, a novel framework that pioneers the adaptation of generative pretraining (via autoregressive modeling) to discriminative recommendation tasks to enhance generalization. GPSD further introduces a sparse parameter freezing strategy that dynamically freezes redundant parameters during fine-tuning, balancing model capacity and training stability. Theoretically, we establish that discriminative performance scales with model size following a power-law relationship. Practically, GPSD unifies architectural paradigms between recommendation and language modeling. At 0.3B parameters, GPSD substantially narrows the train-test generalization gap and outperforms state-of-the-art methods across multiple industrial and public benchmarks. Online A/B tests confirm statistically significant CTR improvements.

Technology Category

Application Category

📝 Abstract

Discriminative recommendation tasks, such as CTR (click-through rate) and CVR (conversion rate) prediction, play critical roles in the ranking stage of large-scale industrial recommender systems. However, training a discriminative model encounters a significant overfitting issue induced by data sparsity. Moreover, this overfitting issue worsens with larger models, causing them to underperform smaller ones. To address the overfitting issue and enhance model scalability, we propose a framework named GPSD ( extbf{G}enerative extbf{P}retraining for extbf{S}calable extbf{D}iscriminative Recommendation), drawing inspiration from generative training, which exhibits no evident signs of overfitting. GPSD leverages the parameters learned from a pretrained generative model to initialize a discriminative model, and subsequently applies a sparse parameter freezing strategy. Extensive experiments conducted on both industrial-scale and publicly available datasets demonstrate the superior performance of GPSD. Moreover, it delivers remarkable improvements in online A/B tests. GPSD offers two primary advantages: 1) it substantially narrows the generalization gap in model training, resulting in better test performance; and 2) it leverages the scalability of Transformers, delivering consistent performance gains as models are scaled up. Specifically, we observe consistent performance improvements as the model dense parameters scale from 13K to 0.3B, closely adhering to power laws. These findings pave the way for unifying the architectures of recommendation models and language models, enabling the direct application of techniques well-established in large language models to recommendation models.

Problem

Research questions and friction points this paper is trying to address.

Address overfitting in discriminative recommendation models

Enhance scalability of Transformer-based recommendation systems

Bridge gap between recommendation and language model architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative pretraining for discriminative recommendation

Sparse parameter freezing strategy

Scaling Transformers via power law adherence

🔎 Similar Papers

No similar papers found.