Off-Policy Actor-Critic for Adversarial Observation Robustness: Virtual Alternative Training via Symmetric Policy Evaluation

📅 2025-06-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

229K/year
🤖 AI Summary
Reinforcement learning (RL) suffers from poor long-horizon robustness under adversarial observations, requires excessive environment interaction, and struggles with off-policy training. To address these challenges, this paper proposes the first fully off-policy adversarially robust RL framework. Our method introduces: (1) a virtual alternating training paradigm that exploits the symmetry of policy evaluation to decouple the agent and adversary, mitigating their strong interdependence; (2) a symmetric Bellman operator coupled with soft-constraint Lagrangian optimization to enable coordinated updates; and (3) seamless integration with the Soft Actor-Critic (SAC) architecture and virtual adversarial perturbation generation, eliminating the need for additional environment sampling. Evaluated across multiple benchmark tasks, our approach significantly improves long-horizon robustness and boosts training efficiency by over 40%. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Recently, robust reinforcement learning (RL) methods designed to handle adversarial input observations have received significant attention, motivated by RL's inherent vulnerabilities. While existing approaches have demonstrated reasonable success, addressing worst-case scenarios over long time horizons requires both minimizing the agent's cumulative rewards for adversaries and training agents to counteract them through alternating learning. However, this process introduces mutual dependencies between the agent and the adversary, making interactions with the environment inefficient and hindering the development of off-policy methods. In this work, we propose a novel off-policy method that eliminates the need for additional environmental interactions by reformulating adversarial learning as a soft-constrained optimization problem. Our approach is theoretically supported by the symmetric property of policy evaluation between the agent and the adversary. The implementation is available at https://github.com/nakanakakosuke/VALT_SAC.
Problem

Research questions and friction points this paper is trying to address.

Enhance adversarial observation robustness in RL
Address mutual dependencies in agent-adversary training
Enable off-policy learning without extra environment interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Off-policy adversarial robustness via soft-constrained optimization
Symmetric policy evaluation for agent-adversary training
Virtual alternative training eliminates environmental interactions
🔎 Similar Papers
No similar papers found.
K
Kosuke Nakanishi
Department of Information Science, Kyoto University, Kyoto, Japan; Honda R&D Co., Ltd., Tokyo, Japan
A
Akihiro Kubo
Department of Information Science, Kyoto University, Kyoto, Japan; ATR Neural Information Processing Laboratories, Kyoto, Japan; International Research Center for Neurointelligence, The University of Tokyo, Tokyo, Japan
Y
Yuji Yasui
Honda R&D Co., Ltd., Tokyo, Japan
Shin Ishii
Shin Ishii
Kyoto University