Automatic Reward Shaping from Multi-Objective Human Heuristics

πŸ“… 2025-12-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In multi-objective reinforcement learning (MORL), manually designed reward functions suffer from subjectivity and poor generalizability. To address this, we propose MORSEβ€”a framework that automatically synthesizes multiple human-specified heuristic rewards into a unified, differentiable, and optimizable composite reward function via a bilevel optimization mechanism. Crucially, MORSE introduces exploratory noise derived from both task performance and prediction errors of a random neural network, enhancing policy exploration and mitigating local optima. Integrated with policy gradient optimization, MORSE is evaluated across diverse robot control tasks in MuJoCo and Isaac Sim. Results demonstrate that it achieves or surpasses the performance of hand-tuned reward functions in both Pareto optimality and overall task performance, while significantly reducing reward engineering effort.

Technology Category

Application Category

πŸ“ Abstract
Designing effective reward functions remains a central challenge in reinforcement learning, especially in multi-objective environments. In this work, we propose Multi-Objective Reward Shaping with Exploration (MORSE), a general framework that automatically combines multiple human-designed heuristic rewards into a unified reward function. MORSE formulates the shaping process as a bi-level optimization problem: the inner loop trains a policy to maximize the current shaped reward, while the outer loop updates the reward function to optimize task performance. To encourage exploration in the reward space and avoid suboptimal local minima, MORSE introduces stochasticity into the shaping process, injecting noise guided by task performance and the prediction error of a fixed, randomly initialized neural network. Experimental results in MuJoCo and Isaac Sim environments show that MORSE effectively balances multiple objectives across various robotic tasks, achieving task performance comparable to those obtained with manually tuned reward functions.
Problem

Research questions and friction points this paper is trying to address.

Automatically combines multiple human-designed heuristic rewards
Formulates reward shaping as a bi-level optimization problem
Encourages exploration in reward space to avoid local minima
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically combines multiple human-designed heuristic rewards
Formulates shaping as bi-level optimization with policy training
Introduces stochasticity for exploration to avoid local minima
πŸ”Ž Similar Papers
No similar papers found.
Y
Yuqing Xie
Tsinghua University
J
Jiayu Chen
Tsinghua University
W
Wenhao Tang
Tsinghua University
Ya Zhang
Ya Zhang
Shanghai Jiao Tong University
Machine learningComputer visionMedical Imaging
C
Chao Yu
Tsinghua University
Y
Yu Wang
Tsinghua University