Unified Policy Value Decomposition for Rapid Adaptation

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the challenge of rapid adaptation to new goals in complex reinforcement learning control tasks by proposing a bilinear Actor-Critic framework grounded in shared low-dimensional goal embeddings. The policy and value function are jointly modeled as bilinear forms composed of shared basis functions and goal-specific coefficients. Inspired by neural gain modulation mechanisms, the approach enables zero-shot transfer without retraining. Evaluated on the MuJoCo Ant multi-directional locomotion task, the model generalizes effectively to unseen movement directions, achieving significant improvements in immediate adaptability for high-dimensional control through specialized policy basis heads and interpolation in the goal embedding space.

Technology Category

Application Category

📝 Abstract

Rapid adaptation in complex control systems remains a central challenge in reinforcement learning. We introduce a framework in which policy and value functions share a low-dimensional coefficient vector - a goal embedding - that captures task identity and enables immediate adaptation to novel tasks without retraining representations. During pretraining, we jointly learn structured value bases and compatible policy bases through a bilinear actor-critic decomposition. The critic factorizes as Q = sum_k G_k(g) y_k(s,a), where G_k(g) is a goal-conditioned coefficient vector and y_k(s,a) are learned value basis functions. This multiplicative gating - where a context signal scales a set of state-dependent bases - is reminiscent of gain modulation observed in Layer 5 pyramidal neurons, where top-down inputs modulate the gain of sensory-driven responses without altering their tuning. Building on Successor Features, we extend the decomposition to the actor, which composes a set of primitive policies weighted by the same coefficients G_k(g). At test time the bases are frozen and G_k(g) is estimated zero-shot via a single forward pass, enabling immediate adaptation to novel tasks without any gradient update. We train a Soft Actor-Critic agent on the MuJoCo Ant environment under a multi-directional locomotion objective, requiring the agent to walk in eight directions specified as continuous goal vectors. The bilinear structure allows each policy head to specialize to a subset of directions, while the shared coefficient layer generalizes across them, accommodating novel directions by interpolating in goal embedding space. Our results suggest that shared low-dimensional goal embeddings offer a general mechanism for rapid, structured adaptation in high-dimensional control, and highlight a potentially biologically plausible principle for efficient transfer in complex reinforcement learning systems.

Problem

Research questions and friction points this paper is trying to address.

rapid adaptation

reinforcement learning

complex control systems

task generalization

zero-shot transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

bilinear decomposition

goal embedding

rapid adaptation