From Parameters to Behavior: Unsupervised Compression of the Policy Space

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deep reinforcement learning (DRL) directly optimizes policies in high-dimensional, redundant parameter space Θ, resulting in poor sample efficiency and weak generalization across tasks. To address this, we propose a policy manifold compression framework: a generative model is trained via behavioral reconstruction loss to map policies into a low-dimensional latent space, where clustering reflects behavioral similarity—not parameter proximity—and latent dimensionality is determined by the intrinsic complexity of the environment. We further design a latent-space policy gradient algorithm to enable efficient optimization within this compressed representation. Experiments on continuous control benchmarks demonstrate up to five orders-of-magnitude parameter compression while preserving over 90% of policy expressivity. Moreover, the method significantly improves cross-task adaptability and sample efficiency, enabling rapid transfer and fine-tuning with minimal environmental interaction.

Technology Category

Application Category

📝 Abstract
Despite its recent successes, Deep Reinforcement Learning (DRL) is notoriously sample-inefficient. We argue that this inefficiency stems from the standard practice of optimizing policies directly in the high-dimensional and highly redundant parameter space $Θ$. This challenge is greatly compounded in multi-task settings. In this work, we develop a novel, unsupervised approach that compresses the policy parameter space $Θ$ into a low-dimensional latent space $mathcal{Z}$. We train a generative model $g:mathcal{Z} oΘ$ by optimizing a behavioral reconstruction loss, which ensures that the latent space is organized by functional similarity rather than proximity in parameterization. We conjecture that the inherent dimensionality of this manifold is a function of the environment's complexity, rather than the size of the policy network. We validate our approach in continuous control domains, showing that the parameterization of standard policy networks can be compressed up to five orders of magnitude while retaining most of its expressivity. As a byproduct, we show that the learned manifold enables task-specific adaptation via Policy Gradient operating in the latent space $mathcal{Z}$.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised compression of high-dimensional policy parameter space
Addressing sample inefficiency in deep reinforcement learning
Organizing latent space by behavioral similarity not parameter proximity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compresses policy parameter space into latent space
Trains generative model with behavioral reconstruction loss
Enables policy gradient adaptation in compressed latent space
🔎 Similar Papers
No similar papers found.