🤖 AI Summary
This work addresses the challenge in inductive generalization for reinforcement learning, where scaling up task complexity often leads to noisy and conflicting reward signals, resulting in unstable training and degraded generalization. To mitigate this, the authors propose DIBS, a novel approach that introduces behavioral cloning into the inductive generalization framework for the first time. DIBS first trains teacher policies for individual tasks using standard reinforcement learning and then leverages the resulting state-action pairs to learn a high-level policy evolution function via behavioral cloning. This decoupled design circumvents the instability of end-to-end reinforcement learning by providing dense and stable supervisory signals derived from the teacher policies. Empirical results demonstrate that DIBS significantly outperforms existing reinforcement learning and meta-reinforcement learning methods across multiple tasks, achieving both more stable training dynamics and superior zero-shot generalization performance.
📝 Abstract
Inductive generalization is a framework for reinforcement learning (RL) generalization in which inductively related task instances admit inductively related policies. Prior work captures this structure via a higher-order policy-evolution function learned directly with RL, but suffers from poor training scalability: as training tasks grow, aggregated reward feedback becomes noisy and conflicting, destabilizing training and weakening generalization. We propose DIBS, a decoupled behavioral cloning approach that separates learning task-specific policies from learning the evolution function. We first learn individual teacher policies per task via standard RL, then fit the evolution function via behavioral cloning on teacher-labeled state-action pairs. This replaces noisy reward aggregation with dense, stable supervision. DIBS achieves significant improvements in both training stability and zero-shot generalization against existing RL and meta-RL algorithms.