APC-RL: Exceeding Data-Driven Behavior Priors with Adaptive Policy Composition

📅 2026-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation in existing reinforcement learning methods, which typically assume that demonstration data are optimal and perfectly aligned with the target task—an assumption often violated in practice due to sparse, suboptimal, or misaligned demonstrations that degrade performance. To overcome this, the paper proposes the Adaptive Policy Composition (APC) framework, which leverages a hierarchical reinforcement learning architecture and normalizing flows to model multi-source behavioral priors. APC dynamically evaluates and weights these priors, enabling adaptive selection from heterogeneous demonstrations without rigid reliance on any single prior. This approach accelerates learning while preserving robustness. Experimental results demonstrate that APC-RL significantly improves sample efficiency under aligned demonstrations and maintains high performance even when demonstrations are severely misaligned or suboptimal, showcasing its robust exploration capabilities.

Technology Category

Application Category

📝 Abstract
Incorporating demonstration data into reinforcement learning (RL) can greatly accelerate learning, but existing approaches often assume demonstrations are optimal and fully aligned with the target task. In practice, demonstrations are frequently sparse, suboptimal, or misaligned, which can degrade performance when these demonstrations are integrated into RL. We propose Adaptive Policy Composition (APC), a hierarchical model that adaptively composes multiple data-driven Normalizing Flow (NF) priors. Instead of enforcing strict adherence to the priors, APC estimates each prior's applicability to the target task while leveraging them for exploration. Moreover, APC either refines useful priors, or sidesteps misaligned ones when necessary to optimize downstream reward. Across diverse benchmarks, APC accelerates learning when demonstrations are aligned, remains robust under severe misalignment, and leverages suboptimal demonstrations to bootstrap exploration while avoiding performance degradation caused by overly strict adherence to suboptimal demonstrations.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
demonstration data
suboptimal demonstrations
misaligned priors
behavior priors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Policy Composition
Normalizing Flow
Reinforcement Learning
Demonstration Priors
Hierarchical Policy
🔎 Similar Papers
No similar papers found.
Finn Rietz
Finn Rietz
Örebro University
Deep LearningReinforcement LearningRobotics
Pedro Zuidberg Dos Martires
Pedro Zuidberg Dos Martires
Örebro University
J
J. A. Stork
Department of Computer Science, Örebro University, Fakultetsgatan 1, Örebro, Sweden