Off-Policy Actor-Critic for Adversarial Observation Robustness: Virtual Alternative Training via Symmetric Policy Evaluation

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Reinforcement learning (RL) suffers from poor long-horizon robustness under adversarial observations, requires excessive environment interaction, and struggles with off-policy training. To address these challenges, this paper proposes the first fully off-policy adversarially robust RL framework. Our method introduces: (1) a virtual alternating training paradigm that exploits the symmetry of policy evaluation to decouple the agent and adversary, mitigating their strong interdependence; (2) a symmetric Bellman operator coupled with soft-constraint Lagrangian optimization to enable coordinated updates; and (3) seamless integration with the Soft Actor-Critic (SAC) architecture and virtual adversarial perturbation generation, eliminating the need for additional environment sampling. Evaluated across multiple benchmark tasks, our approach significantly improves long-horizon robustness and boosts training efficiency by over 40%. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Recently, robust reinforcement learning (RL) methods designed to handle adversarial input observations have received significant attention, motivated by RL's inherent vulnerabilities. While existing approaches have demonstrated reasonable success, addressing worst-case scenarios over long time horizons requires both minimizing the agent's cumulative rewards for adversaries and training agents to counteract them through alternating learning. However, this process introduces mutual dependencies between the agent and the adversary, making interactions with the environment inefficient and hindering the development of off-policy methods. In this work, we propose a novel off-policy method that eliminates the need for additional environmental interactions by reformulating adversarial learning as a soft-constrained optimization problem. Our approach is theoretically supported by the symmetric property of policy evaluation between the agent and the adversary. The implementation is available at https://github.com/nakanakakosuke/VALT_SAC.

Problem

Research questions and friction points this paper is trying to address.

Enhance adversarial observation robustness in RL

Address mutual dependencies in agent-adversary training

Enable off-policy learning without extra environment interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Off-policy adversarial robustness via soft-constrained optimization

Symmetric policy evaluation for agent-adversary training

Virtual alternative training eliminates environmental interactions

🔎 Similar Papers

No similar papers found.