🤖 AI Summary
Reinforcement learning (RL) for autonomous vehicle control suffers from reward engineering bias, while imitation learning relies on scarce, high-quality expert action data. Method: We propose a reward-free end-to-end RL framework that eliminates both explicit reward signals and expert action labels. It introduces a Target State Prediction Network (TSPN) and a Reward-Free State-Guided Policy Network (RFSGPN), optimizing the policy by minimizing state prediction error relative to target states—e.g., desired trajectory points—rather than reward or action supervision. Contribution/Results: This work establishes the first purely target-state-driven control paradigm, requiring only environmental observations and target-state supervision. It obviates reward shaping and expert demonstrations entirely. Evaluated on standard vehicle control benchmarks, our method achieves significantly improved sample efficiency and policy robustness, enabling effective and stable autonomous driving learning even in reward-absent settings.
📝 Abstract
Reinforcement learning plays a crucial role in vehicle control by guiding agents to learn optimal control strategies through designing or learning appropriate reward signals. However, in vehicle control applications, rewards typically need to be manually designed while considering multiple implicit factors, which easily introduces human biases. Although imitation learning methods does not rely on explicit reward signals, they necessitate high-quality expert actions, which are often challenging to acquire. To address these issues, we propose a reward-free reinforcement learning framework (RFRLF). This framework directly learns the target states to optimize agent behavior through a target state prediction network (TSPN) and a reward-free state-guided policy network (RFSGPN), avoiding the dependence on manually designed reward signals. Specifically, the policy network is learned via minimizing the differences between the predicted state and the expert state. Experimental results demonstrate the effectiveness of the proposed RFRLF in controlling vehicle driving, showing its advantages in improving learning efficiency and adapting to reward-free environments.