C-MORL: Multi-Objective Reinforcement Learning through Efficient Discovery of Pareto Front

📅 2024-10-03
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multi-objective reinforcement learning (MORL) faces two key challenges: inefficient Pareto frontier discovery and poor scalability of high-dimensional preference embedding. To address these, we propose a two-stage constrained optimization framework: (1) parallel training of preference-conditioned policies in the first stage, and (2) dynamic frontier completion via multi-objective constrained optimization in the second stage. This work introduces constrained policy optimization—previously unexplored in MORL—enabling efficient, dense, and scalable coverage of Pareto fronts in up to 9-dimensional objective spaces. Empirically, our method significantly improves hypervolume, expected utility, and sparsity metrics across both discrete and continuous control benchmarks. Notably, it maintains robust performance in the challenging 9-objective setting, consistently outperforming state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract
Multi-objective reinforcement learning (MORL) excels at handling rapidly changing preferences in tasks that involve multiple criteria, even for unseen preferences. However, previous dominating MORL methods typically generate a fixed policy set or preference-conditioned policy through multiple training iterations exclusively for sampled preference vectors, and cannot ensure the efficient discovery of the Pareto front. Furthermore, integrating preferences into the input of policy or value functions presents scalability challenges, in particular as the dimension of the state and preference space grow, which can complicate the learning process and hinder the algorithm's performance on more complex tasks. To address these issues, we propose a two-stage Pareto front discovery algorithm called Constrained MORL (C-MORL), which serves as a seamless bridge between constrained policy optimization and MORL. Concretely, a set of policies is trained in parallel in the initialization stage, with each optimized towards its individual preference over the multiple objectives. Then, to fill the remaining vacancies in the Pareto front, the constrained optimization steps are employed to maximize one objective while constraining the other objectives to exceed a predefined threshold. Empirically, compared to recent advancements in MORL methods, our algorithm achieves more consistent and superior performances in terms of hypervolume, expected utility, and sparsity on both discrete and continuous control tasks, especially with numerous objectives (up to nine objectives in our experiments).
Problem

Research questions and friction points this paper is trying to address.

Efficiently discovering Pareto front in MORL
Scalability challenges in high-dimensional state spaces
Improving performance in multi-objective control tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage Pareto front discovery algorithm
Parallel policy training with individual preferences
Constrained optimization for objective maximization
🔎 Similar Papers
No similar papers found.
R
Ruohong Liu
Hong Kong University of Science and Technology (Guangzhou)
Y
Yuxin Pan
Hong Kong University of Science and Technology
Linjie Xu
Linjie Xu
Queen Mary University of London
Reinforcement Learning
L
Lei Song
Microsoft Research Asia
Pengcheng You
Pengcheng You
Peking University
Yize Chen
Yize Chen
Assistant Professor, University of Alberta
Machine LearningPower SystemsOptimizationControl
J
Jiang Bian
Microsoft Research Asia