Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing AI-generated bidding (AIGB) methods suffer from coarse-grained quality assessment and static dataset constraints, hindering policy generalization and stable optimization. To address these limitations, we propose AIGB-Pearl: a novel framework that models bidding as a trajectory generation task. It introduces the first non-bootstrapped trajectory evaluator, integrating LLM-based representations, hybrid pointwise/pairwise losses, and expert feedback for fine-grained generation quality assessment. Furthermore, AIGB-Pearl synergizes diffusion-model-driven generative planning with offline reinforcement learning to enable efficient and robust policy exploration. Evaluated on both synthetic and real-world advertising systems, AIGB-Pearl achieves significant improvements in bid stability and campaign performance, attaining state-of-the-art results. Extensive experiments validate its strong generalization capability across diverse scenarios and practical efficacy in production environments.

Technology Category

Application Category

📝 Abstract
Auto-bidding is an essential tool for advertisers to enhance their advertising performance. Recent progress has shown that AI-Generated Bidding (AIGB), which formulates the auto-bidding as a trajectory generation task and trains a conditional diffusion-based planner on offline data, achieves superior and stable performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still encounter a performance bottleneck due to their neglect of fine-grained generation quality evaluation and inability to explore beyond static datasets. To address this, we propose AIGB-Pearl (emph{Planning with EvAluator via RL}), a novel method that integrates generative planning and policy optimization. The key to AIGB-Pearl is to construct a non-bootstrapped emph{trajectory evaluator} to assign rewards and guide policy search, enabling the planner to optimize its generation quality iteratively through interaction. Furthermore, to enhance trajectory evaluator accuracy in offline settings, we incorporate three key techniques: (i) a Large Language Model (LLM)-based architecture for better representational capacity, (ii) hybrid point-wise and pair-wise losses for better score learning, and (iii) adaptive integration of expert feedback for better generalization ability. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.
Problem

Research questions and friction points this paper is trying to address.

Improving auto-bidding performance through offline reward evaluation
Addressing fine-grained generation quality assessment limitations in AIGB
Enhancing trajectory evaluator accuracy with LLM architecture and hybrid losses
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based trajectory evaluator for reward assignment
Hybrid point-wise and pair-wise loss functions
Offline policy optimization with expert feedback integration
🔎 Similar Papers
No similar papers found.
Zhiyu Mou
Zhiyu Mou
M.S. student at Tsinghua University
machine learningnetwork intelligencereinforcement learninggraph neural network
Y
Yiqin Lv
Alibaba Group, Beijing, China; Department of Automation, Tsinghua University, Beijing, China
M
Miao Xu
Alibaba Group, Beijing, China
Cheems Wang
Cheems Wang
Tsinghua University, Universiteit van Amsterdam, Sichuan University
Large ModelsMeta LearningMulti-Task LearningReinforcement LearningGenerative Modeling
Y
Yixiu Mao
Department of Automation, Tsinghua University, Beijing, China
Qichen Ye
Qichen Ye
Peking University
Natural Language ProcessingRecommendation System
C
Chao Li
Alibaba Group, Beijing, China
R
Rongquan Bai
Alibaba Group, Beijing, China
C
Chuan Yu
Alibaba Group, Beijing, China
J
Jian Xu
Alibaba Group, Beijing, China
B
Bo Zheng
Alibaba Group, Beijing, China