Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets

πŸ“… 2025-10-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Offline reinforcement learning (RL) suffers from performance degradation in safety-critical applications when training data is corruptedβ€”e.g., by adversarial poisoning or system failures. To address this, we propose Density-Ratio Weighted Behavior Cloning (DR-BC), a robust offline RL method that leverages a small clean reference dataset to estimate trajectory-level density ratios via a binary discriminator, thereby automatically identifying and down-weighting anomalous samples without requiring prior knowledge of the corruption mechanism. We theoretically establish that DR-BC converges to the optimal policy trained on clean data, even under arbitrary contamination rates. Our approach integrates a truncated density-ratio-weighted behavior cloning objective into a principled offline RL framework. Empirical evaluation demonstrates that DR-BC achieves near-clean-performance under high contamination levels, significantly outperforming standard behavior cloning (BC), BCQ, and BRAC across diverse benchmarks.

Technology Category

Application Category

πŸ“ Abstract
Offline reinforcement learning (RL) enables policy optimization from fixed datasets, making it suitable for safety-critical applications where online exploration is infeasible. However, these datasets are often contaminated by adversarial poisoning, system errors, or low-quality samples, leading to degraded policy performance in standard behavioral cloning (BC) and offline RL methods. This paper introduces Density-Ratio Weighted Behavioral Cloning (Weighted BC), a robust imitation learning approach that uses a small, verified clean reference set to estimate trajectory-level density ratios via a binary discriminator. These ratios are clipped and used as weights in the BC objective to prioritize clean expert behavior while down-weighting or discarding corrupted data, without requiring knowledge of the contamination mechanism. We establish theoretical guarantees showing convergence to the clean expert policy with finite-sample bounds that are independent of the contamination rate. A comprehensive evaluation framework is established, which incorporates various poisoning protocols (reward, state, transition, and action) on continuous control benchmarks. Experiments demonstrate that Weighted BC maintains near-optimal performance even at high contamination ratios outperforming baselines such as traditional BC, batch-constrained Q-learning (BCQ) and behavior regularized actor-critic (BRAC).
Problem

Research questions and friction points this paper is trying to address.

Robust policy learning from corrupted offline datasets
Mitigating adversarial poisoning and system errors in imitation learning
Achieving near-optimal performance under high contamination rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses density ratios to weight imitation learning
Leverages clean reference set via binary discriminator
Clips ratios to prioritize expert data
πŸ”Ž Similar Papers
No similar papers found.
S
Shriram Karpoora Sundara Pandian
Department of Cybersecurity, Rochester Institute of Technology, Rochester, NY 14623
Ali Baheri
Ali Baheri
Assistant Professor, Rochester Institute of Technology
Safe LearningGeometryReinforcement LearningOptimal Transport