π€ AI Summary
Existing quality-diversity reinforcement learning methods struggle to balance performance and behavioral diversity due to insufficient exploration of policy branching structures. This work proposes SV-QD-RL, a novel framework that couples policy architecture with value learning for the first time. By employing structural masks to define learning subspaces, it constructs structure-conditioned Actor-Critic branches, each equipped with a dedicated critic and a replay-state mechanism to jointly promote behavioral specialization and high-quality policy generation. Integrated with a behavior-value joint archive management strategy, the method efficiently constructs a diverse and high-performing policy repertoire on MuJoCo tasks. Ablation studies confirm the complementary contributions of individual components and demonstrate the frameworkβs capability to retrieve policies tailored to specific behavioral requirements on demand.
π Abstract
Quality-diversity reinforcement learning (QD-RL) aims to construct policy repertoires that contain both high-performing and behaviorally diverse policies. Existing QD-RL methods mainly diversify policy instances after rollout evaluation or use learned value information to improve policy quality and behavior targeting, while the learning branches that generate candidate policies remain less explored. This paper proposes SV-QD-RL, a structure-value coupled framework that represents each candidate as a structure-conditioned actor-critic branch. Each branch contains an actor, a structural mask, a branch-specific critic, a replay state, and evaluation attributes including behavior, return, sparsity, and value profile. The structural mask defines the actor subspace in which the branch learns, while the branch-specific critic and replay state shape its value-learning trajectory. A branch-aware QD archive then evaluates and retains branches according to behavioral quality, structural footprint, and value-profile information. Experiments on MuJoCo continuous-control tasks show that SV-QD-RL constructs policy repertoires with strong archive quality and behaviorally useful diversity. Ablation and diagnostic analyses further indicate that structural conditioning, critic differentiation, and memory-consistent refinement make complementary contributions to behavioral specialization. Schedule-aware repertoire evaluation shows that the learned archive provides selectable policy alternatives under changing behavior-level requirements. These results suggest that coupling actor structure with branch-specific value learning is an effective mechanism for generating diverse QD-RL policy repertoires.