🤖 AI Summary
This work addresses the challenge of inefficient exploration in high-dimensional continuous control, where conventional strategies suffer from rapidly diminishing effectiveness as action space dimensionality increases. To overcome this limitation, the authors propose Qflex, a novel method that introduces value-guided probability flows directly in the native high-dimensional action space, replacing isotropic noise with directed exploration. By integrating a learnable source distribution with gradients of the value function, Qflex enables efficient and scalable exploration within an online reinforcement learning framework, circumventing the representational losses associated with dimensionality reduction. Empirical results demonstrate that Qflex significantly outperforms existing approaches across multiple high-dimensional continuous control benchmarks and successfully enables a full-body musculoskeletal model to execute complex, agile motor tasks.
📝 Abstract
Controlling high-dimensional systems in biological and robotic applications is challenging due to expansive state-action spaces, where effective exploration is critical. Commonly used exploration strategies in reinforcement learning are largely undirected with sharp degradation as action dimensionality grows. Many existing methods resort to dimensionality reduction, which constrains policy expressiveness and forfeits system flexibility. We introduce Q-guided Flow Exploration (Qflex), a scalable reinforcement learning method that conducts exploration directly in the native high-dimensional action space. During training, Qflex traverses actions from a learnable source distribution along a probability flow induced by the learned value function, aligning exploration with task-relevant gradients rather than isotropic noise. Our proposed method substantially outperforms representative online reinforcement learning baselines across diverse high-dimensional continuous-control benchmarks. Qflex also successfully controls a full-body human musculoskeletal model to perform agile, complex movements, demonstrating superior scalability and sample efficiency in very high-dimensional settings. Our results indicate that value-guided flows offer a principled and practical route to exploration at scale.