🤖 AI Summary
Reinforcement learning (RL) suffers from poor exploration under sparse or adversarial reward settings, yet the role of policy initialization in shaping early exploration remains poorly understood.
Method: Leveraging infinite-width neural network theory and continuous-time limit analysis, we model untrained policy dynamics via ballistic–diffusive trajectory analysis to characterize how deep neural policy architectures implicitly bias exploration prior to training.
Contribution/Results: We establish, for the first time, that randomly initialized deep policies—without any gradient updates—can generate structured action sequences and nontrivial state visitation distributions. This reveals an intrinsic, architecture-dependent exploration mechanism rooted in initialization. We elevate policy initialization to a designable exploration control primitive, systematically quantifying how inductive biases of mainstream architectures (e.g., MLPs, CNNs) govern early exploration dynamics. Our work provides the first unified theoretical–empirical framework linking network structure, initialization, and exploratory capacity—enabling principled initialization-aware exploration design in deep RL.
📝 Abstract
Exploration remains a fundamental challenge in reinforcement learning (RL), particularly in environments with sparse or adversarial reward structures. In this work, we study how the architecture of deep neural policies implicitly shapes exploration before training. We theoretically and empirically demonstrate strategies for generating ballistic or diffusive trajectories from untrained policies in a toy model. Using the theory of infinite-width networks and a continuous-time limit, we show that untrained policies return correlated actions and result in non-trivial state-visitation distributions. We discuss the distributions of the corresponding trajectories for a standard architecture, revealing insights into inductive biases for tackling exploration. Our results establish a theoretical and experimental framework for using policy initialization as a design tool to understand exploration behavior in early training.