🤖 AI Summary
Existing nonlinear scalarization methods struggle to guarantee uniqueness and continuity of the mapping from preferences to Pareto-optimal solutions, thereby limiting coverage of dense Pareto fronts. This work proposes Smooth Chebyshev Scalarization and rigorously proves that the induced Pareto-optimal return vectors are uniquely determined by preferences and Lipschitz continuous with respect to them. Building upon an occupancy measure formulation, the authors develop the Concave Mirror Descent Policy Iteration (CMDPI) algorithm and establish its equivalence to a KL-regularized MDP, ensuring policy continuity in preference space. Integrated with a KL-regularized deep actor-critic architecture, the proposed method achieves the best average hypervolume ranking across eight MO-Gymnasium tasks and significantly outperforms discrete-action variants in continuous control settings.
📝 Abstract
Preference-conditioned multi-objective reinforcement learning aims to learn a single policy that captures trade-offs across preferences, but under nonlinear scalarization the uniqueness and continuity of the preference-to-solution correspondence remain unclear. We study this problem in tabular multi-objective Markov decision processes (MDPs) using smooth Tchebycheff scalarization as a monotone utility. Under mild interior conditions on the preference set, we prove that each preference induces a unique Pareto-optimal return vector and that this vector depends Lipschitz-continuously on the preference, providing a principled foundation for preference sweeping toward dense Pareto-front coverage. To compute these targets, we formulate the problem over occupancy measures and derive Concave Mirror Descent Policy Iteration (CMDPI), which achieves an $O(1/k)$ objective-suboptimality rate. We further show that each update is equivalent to solving a Kullback-Leibler-regularized MDP with the previous policy as reference, yielding a policy-iteration interpretation and finite-iterate policy continuity across preferences. We instantiate the update as a deep actor-critic algorithm preserving previous-policy regularization. On eight MO-Gymnasium tasks, it achieves the best average hypervolume rank among recent baselines and strong expected-utility performance. Continuous-control experiments indicate gains beyond the discrete-action setting.