🤖 AI Summary
Generative Adversarial Imitation Learning (GAIL) suffers from low sample efficiency and training instability due to its reliance on on-policy policy gradient methods (e.g., TRPO). Method: This paper proposes an off-policy adversarial imitation learning framework that integrates off-policy experience replay, twin Q-network stabilization, and value function learning—eliminating dependence on explicit reward functions and synchronized policy updates. Built upon a deep Actor-Critic architecture, it enables efficient policy learning without environmental reward signals. Contribution/Results: Experiments on multiple benchmark tasks demonstrate significant improvements in sample efficiency, with convergence accelerated by 3–5× compared to baseline methods. The framework also exhibits enhanced training stability and achieves expert-level performance using substantially fewer expert demonstrations.
📝 Abstract
Learning complex policies with Reinforcement Learning (RL) is often hindered by instability and slow convergence, a problem exacerbated by the difficulty of reward engineering. Imitation Learning (IL) from expert demonstrations bypasses this reliance on rewards. However, state-of-the-art IL methods, exemplified by Generative Adversarial Imitation Learning (GAIL)Ho et. al, suffer from severe sample inefficiency. This is a direct consequence of their foundational on-policy algorithms, such as TRPO Schulman et.al. In this work, we introduce an adversarial imitation learning algorithm that incorporates off-policy learning to improve sample efficiency. By combining an off-policy framework with auxiliary techniques specifically, double Q network based stabilization and value learning without reward function inference we demonstrate a reduction in the samples required to robustly match expert behavior.