Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

238K/year
🤖 AI Summary
This work addresses the challenges of low sample efficiency and performance degradation commonly encountered when fine-tuning large behavioral models via conventional reinforcement learning in robotic dexterous manipulation tasks with sparse rewards. The authors propose a coherent off-policy improvement method that integrates inverse reinforcement learning with behavioral cloning: a dense reward function is learned from expert demonstrations, and the pretrained policy is efficiently optimized within a theoretically grounded imitation learning framework while preserving its initial optimality. Evaluated across six sparse-reward tasks, the approach consistently maintains or improves performance, achieving success rates exceeding 90% on five complex tasks and significantly outperforming sparse-reward reinforcement learning baselines.
📝 Abstract
Distilling expert demonstration data into large generative models using behavioral cloning is a scalable approach to learning capable policies for robotic control, particularly for dexterous manipulation. Reinforcement learning (RL) can be used as a means to finetune these policies further using additional experience. An open question is whether RL is more sample-efficient than collecting more human demonstrations. Prior work has finetuned large pretrained policies in a scalable fashion by applying RL to a smaller residual policy that corrects the pretrained model. However, for the typical sparse reward tasks, RL algorithms can struggle to optimize the behavior in a sample-efficient manner. We explore inverse reinforcement learning, where a dense reward function is learned from expert demonstrations, potentially reducing the challenge of RL finetuning. We specifically consider coherent imitation learning, an IRL method that facilitates improvement of the BC policy through using a specific reward formulation with theoretical guarantees. We show that our IRL method maintains or improves the performance of pi-0.5 on all six sparse manipulation tasks and achieves a $\geq 90\%$ success rate on five out of six complex manipulation tasks, outperforming RL-based baselines using sparse rewards. By ensuring our initial pretrained finetuning policy is optimal for our initial reward and critic, our method circumvents the initial drop commonly seen in RL finetuning and enables faster improvement.
Problem

Research questions and friction points this paper is trying to address.

off-policy improvement
large behavior models
sparse reward
inverse reinforcement learning
behavioral cloning
Innovation

Methods, ideas, or system contributions that make the work stand out.

inverse reinforcement learning
coherent imitation learning
behavioral cloning
dense reward learning
off-policy improvement
🔎 Similar Papers
2024-06-27Conference on Empirical Methods in Natural Language ProcessingCitations: 1