🤖 AI Summary
This work addresses the “zero collapse” phenomenon in reinforcement learning within discontinuous reward environments such as repeated first-price auctions, where policy gradient methods often overshoot optimal regions during exploration and updates, causing policies to fall into flat zero-reward plateaus with vanishing gradients and poor recoverability. The paper formally introduces and analyzes zero collapse, characterizing how policy stochasticity and learning rates affect training stability, and develops a theoretical reinforcement learning framework tailored to auction settings. Through theoretical analysis of REINFORCE and various Actor-Critic algorithms—complemented by empirical validation—the study proposes stable training strategies leveraging careful initialization and network architecture design. Experimental results demonstrate that the proposed approach effectively mitigates zero collapse, substantially improving sample efficiency and policy stability.
📝 Abstract
Bidding in repeated auctions is a central challenge for reinforcement learning (RL), combining continuous control with the strategic complexities of digital advertising. While policy gradient and value-based methods seem well-suited for these settings, they often struggle with the discontinuous, "cliff-like" nature of auction reward landscapes. In a first-price auction, for example, a bidder receives zero reward until they cross a specific threshold, after which the reward decreases as the bid increases. This creates a landscape of flat, zero-reward regions separated by sharp boundaries.
We identify a fundamental failure mode in this setting termed "zero collapse." We show that stochastic exploration and gradient-based updates can cause policies to overshoot optimal high-reward regions and enter flat, zero-reward regimes. Once there, the lack of an informative gradient signal makes recovery extremely sample-inefficient, effectively trapping the agent. We find that actor-critic methods are particularly susceptible, as biased value estimates can accelerate this movement toward unstable regions.
Our contributions include: (1) a mechanistic explanation of how discontinuous rewards lead to vanishing signals and zero collapse; (2) an analysis of the interaction between policy stochasticity and step size; and (3) an empirical demonstration of this phenomenon across REINFORCE and actor-critic variants. We propose practical mitigation strategies involving initialization and architectural choices to improve stability. Finally, we introduce a formal RL framework for auction environments highlighting their unique structural properties.