Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generative Adversarial Imitation Learning (GAIL) suffers from low sample efficiency and training instability due to its reliance on on-policy policy gradient methods (e.g., TRPO). Method: This paper proposes an off-policy adversarial imitation learning framework that integrates off-policy experience replay, twin Q-network stabilization, and value function learning—eliminating dependence on explicit reward functions and synchronized policy updates. Built upon a deep Actor-Critic architecture, it enables efficient policy learning without environmental reward signals. Contribution/Results: Experiments on multiple benchmark tasks demonstrate significant improvements in sample efficiency, with convergence accelerated by 3–5× compared to baseline methods. The framework also exhibits enhanced training stability and achieves expert-level performance using substantially fewer expert demonstrations.

Technology Category

Application Category

📝 Abstract
Learning complex policies with Reinforcement Learning (RL) is often hindered by instability and slow convergence, a problem exacerbated by the difficulty of reward engineering. Imitation Learning (IL) from expert demonstrations bypasses this reliance on rewards. However, state-of-the-art IL methods, exemplified by Generative Adversarial Imitation Learning (GAIL)Ho et. al, suffer from severe sample inefficiency. This is a direct consequence of their foundational on-policy algorithms, such as TRPO Schulman et.al. In this work, we introduce an adversarial imitation learning algorithm that incorporates off-policy learning to improve sample efficiency. By combining an off-policy framework with auxiliary techniques specifically, double Q network based stabilization and value learning without reward function inference we demonstrate a reduction in the samples required to robustly match expert behavior.
Problem

Research questions and friction points this paper is trying to address.

Improves sample efficiency in imitation learning using off-policy methods
Stabilizes adversarial imitation learning with deep actor-critic techniques
Reduces dependency on reward engineering through expert demonstration learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses off-policy learning for sample efficiency
Implements double Q network stabilization technique
Learns value functions without reward inference
🔎 Similar Papers
No similar papers found.