π€ AI Summary
Offline reinforcement learning (RL) suffers from performance degradation in safety-critical applications when training data is corruptedβe.g., by adversarial poisoning or system failures. To address this, we propose Density-Ratio Weighted Behavior Cloning (DR-BC), a robust offline RL method that leverages a small clean reference dataset to estimate trajectory-level density ratios via a binary discriminator, thereby automatically identifying and down-weighting anomalous samples without requiring prior knowledge of the corruption mechanism. We theoretically establish that DR-BC converges to the optimal policy trained on clean data, even under arbitrary contamination rates. Our approach integrates a truncated density-ratio-weighted behavior cloning objective into a principled offline RL framework. Empirical evaluation demonstrates that DR-BC achieves near-clean-performance under high contamination levels, significantly outperforming standard behavior cloning (BC), BCQ, and BRAC across diverse benchmarks.
π Abstract
Offline reinforcement learning (RL) enables policy optimization from fixed datasets, making it suitable for safety-critical applications where online exploration is infeasible. However, these datasets are often contaminated by adversarial poisoning, system errors, or low-quality samples, leading to degraded policy performance in standard behavioral cloning (BC) and offline RL methods. This paper introduces Density-Ratio Weighted Behavioral Cloning (Weighted BC), a robust imitation learning approach that uses a small, verified clean reference set to estimate trajectory-level density ratios via a binary discriminator. These ratios are clipped and used as weights in the BC objective to prioritize clean expert behavior while down-weighting or discarding corrupted data, without requiring knowledge of the contamination mechanism. We establish theoretical guarantees showing convergence to the clean expert policy with finite-sample bounds that are independent of the contamination rate. A comprehensive evaluation framework is established, which incorporates various poisoning protocols (reward, state, transition, and action) on continuous control benchmarks. Experiments demonstrate that Weighted BC maintains near-optimal performance even at high contamination ratios outperforming baselines such as traditional BC, batch-constrained Q-learning (BCQ) and behavior regularized actor-critic (BRAC).