APC-RL: Exceeding Data-Driven Behavior Priors with Adaptive Policy Composition

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses a critical limitation in existing reinforcement learning methods, which typically assume that demonstration data are optimal and perfectly aligned with the target task—an assumption often violated in practice due to sparse, suboptimal, or misaligned demonstrations that degrade performance. To overcome this, the paper proposes the Adaptive Policy Composition (APC) framework, which leverages a hierarchical reinforcement learning architecture and normalizing flows to model multi-source behavioral priors. APC dynamically evaluates and weights these priors, enabling adaptive selection from heterogeneous demonstrations without rigid reliance on any single prior. This approach accelerates learning while preserving robustness. Experimental results demonstrate that APC-RL significantly improves sample efficiency under aligned demonstrations and maintains high performance even when demonstrations are severely misaligned or suboptimal, showcasing its robust exploration capabilities.

Technology Category

Application Category

📝 Abstract

Incorporating demonstration data into reinforcement learning (RL) can greatly accelerate learning, but existing approaches often assume demonstrations are optimal and fully aligned with the target task. In practice, demonstrations are frequently sparse, suboptimal, or misaligned, which can degrade performance when these demonstrations are integrated into RL. We propose Adaptive Policy Composition (APC), a hierarchical model that adaptively composes multiple data-driven Normalizing Flow (NF) priors. Instead of enforcing strict adherence to the priors, APC estimates each prior's applicability to the target task while leveraging them for exploration. Moreover, APC either refines useful priors, or sidesteps misaligned ones when necessary to optimize downstream reward. Across diverse benchmarks, APC accelerates learning when demonstrations are aligned, remains robust under severe misalignment, and leverages suboptimal demonstrations to bootstrap exploration while avoiding performance degradation caused by overly strict adherence to suboptimal demonstrations.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

demonstration data

suboptimal demonstrations

misaligned priors

behavior priors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Policy Composition

Normalizing Flow

Reinforcement Learning

Demonstration Priors

Hierarchical Policy

🔎 Similar Papers

No similar papers found.

Authors to Follow