Exploration Behavior of Untrained Policies

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Reinforcement learning (RL) suffers from poor exploration under sparse or adversarial reward settings, yet the role of policy initialization in shaping early exploration remains poorly understood. Method: Leveraging infinite-width neural network theory and continuous-time limit analysis, we model untrained policy dynamics via ballistic–diffusive trajectory analysis to characterize how deep neural policy architectures implicitly bias exploration prior to training. Contribution/Results: We establish, for the first time, that randomly initialized deep policies—without any gradient updates—can generate structured action sequences and nontrivial state visitation distributions. This reveals an intrinsic, architecture-dependent exploration mechanism rooted in initialization. We elevate policy initialization to a designable exploration control primitive, systematically quantifying how inductive biases of mainstream architectures (e.g., MLPs, CNNs) govern early exploration dynamics. Our work provides the first unified theoretical–empirical framework linking network structure, initialization, and exploratory capacity—enabling principled initialization-aware exploration design in deep RL.

Technology Category

Application Category

📝 Abstract

Exploration remains a fundamental challenge in reinforcement learning (RL), particularly in environments with sparse or adversarial reward structures. In this work, we study how the architecture of deep neural policies implicitly shapes exploration before training. We theoretically and empirically demonstrate strategies for generating ballistic or diffusive trajectories from untrained policies in a toy model. Using the theory of infinite-width networks and a continuous-time limit, we show that untrained policies return correlated actions and result in non-trivial state-visitation distributions. We discuss the distributions of the corresponding trajectories for a standard architecture, revealing insights into inductive biases for tackling exploration. Our results establish a theoretical and experimental framework for using policy initialization as a design tool to understand exploration behavior in early training.

Problem

Research questions and friction points this paper is trying to address.

Study how neural policy architecture shapes pre-training exploration

Analyze ballistic or diffusive trajectories from untrained policies

Establish framework for policy initialization as exploration design tool

Innovation

Methods, ideas, or system contributions that make the work stand out.

Untrained policies generate ballistic or diffusive trajectories

Infinite-width networks analyze correlated action distributions

Policy initialization as design tool for exploration

🔎 Similar Papers

No similar papers found.