Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Offline reinforcement learning (RL) suffers from performance degradation in safety-critical applications when training data is corrupted—e.g., by adversarial poisoning or system failures. To address this, we propose Density-Ratio Weighted Behavior Cloning (DR-BC), a robust offline RL method that leverages a small clean reference dataset to estimate trajectory-level density ratios via a binary discriminator, thereby automatically identifying and down-weighting anomalous samples without requiring prior knowledge of the corruption mechanism. We theoretically establish that DR-BC converges to the optimal policy trained on clean data, even under arbitrary contamination rates. Our approach integrates a truncated density-ratio-weighted behavior cloning objective into a principled offline RL framework. Empirical evaluation demonstrates that DR-BC achieves near-clean-performance under high contamination levels, significantly outperforming standard behavior cloning (BC), BCQ, and BRAC across diverse benchmarks.

Technology Category

Application Category

📝 Abstract

Offline reinforcement learning (RL) enables policy optimization from fixed datasets, making it suitable for safety-critical applications where online exploration is infeasible. However, these datasets are often contaminated by adversarial poisoning, system errors, or low-quality samples, leading to degraded policy performance in standard behavioral cloning (BC) and offline RL methods. This paper introduces Density-Ratio Weighted Behavioral Cloning (Weighted BC), a robust imitation learning approach that uses a small, verified clean reference set to estimate trajectory-level density ratios via a binary discriminator. These ratios are clipped and used as weights in the BC objective to prioritize clean expert behavior while down-weighting or discarding corrupted data, without requiring knowledge of the contamination mechanism. We establish theoretical guarantees showing convergence to the clean expert policy with finite-sample bounds that are independent of the contamination rate. A comprehensive evaluation framework is established, which incorporates various poisoning protocols (reward, state, transition, and action) on continuous control benchmarks. Experiments demonstrate that Weighted BC maintains near-optimal performance even at high contamination ratios outperforming baselines such as traditional BC, batch-constrained Q-learning (BCQ) and behavior regularized actor-critic (BRAC).

Problem

Research questions and friction points this paper is trying to address.

Robust policy learning from corrupted offline datasets

Mitigating adversarial poisoning and system errors in imitation learning

Achieving near-optimal performance under high contamination rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses density ratios to weight imitation learning

Leverages clean reference set via binary discriminator

Clips ratios to prioritize expert data

🔎 Similar Papers

No similar papers found.

Authors to Follow