Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

Generative Adversarial Imitation Learning (GAIL) suffers from low sample efficiency and training instability due to its reliance on on-policy policy gradient methods (e.g., TRPO). Method: This paper proposes an off-policy adversarial imitation learning framework that integrates off-policy experience replay, twin Q-network stabilization, and value function learning—eliminating dependence on explicit reward functions and synchronized policy updates. Built upon a deep Actor-Critic architecture, it enables efficient policy learning without environmental reward signals. Contribution/Results: Experiments on multiple benchmark tasks demonstrate significant improvements in sample efficiency, with convergence accelerated by 3–5× compared to baseline methods. The framework also exhibits enhanced training stability and achieves expert-level performance using substantially fewer expert demonstrations.

Technology Category

Application Category

📝 Abstract

Learning complex policies with Reinforcement Learning (RL) is often hindered by instability and slow convergence, a problem exacerbated by the difficulty of reward engineering. Imitation Learning (IL) from expert demonstrations bypasses this reliance on rewards. However, state-of-the-art IL methods, exemplified by Generative Adversarial Imitation Learning (GAIL)Ho et. al, suffer from severe sample inefficiency. This is a direct consequence of their foundational on-policy algorithms, such as TRPO Schulman et.al. In this work, we introduce an adversarial imitation learning algorithm that incorporates off-policy learning to improve sample efficiency. By combining an off-policy framework with auxiliary techniques specifically, double Q network based stabilization and value learning without reward function inference we demonstrate a reduction in the samples required to robustly match expert behavior.

Problem

Research questions and friction points this paper is trying to address.

Improves sample efficiency in imitation learning using off-policy methods

Stabilizes adversarial imitation learning with deep actor-critic techniques

Reduces dependency on reward engineering through expert demonstration learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses off-policy learning for sample efficiency

Implements double Q network stabilization technique

Learns value functions without reward inference

🔎 Similar Papers

No similar papers found.