Breaking the Bias Barrier in Concave Multi-Objective Reinforcement Learning

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the issue of policy gradient estimation bias induced by nonlinear concave scalarization in multi-objective reinforcement learning, which degrades sample complexity. The authors propose a novel algorithm that integrates natural policy gradient (NPG) with a multilevel Monte Carlo (MLMC) estimator to effectively control gradient bias and substantially reduce sampling costs. Theoretical analysis demonstrates that, under general concave scalarization functions, the method achieves the optimal $\widetilde{O}(\varepsilon^{-2})$ sample complexity—improving upon the prior best-known $\widetilde{O}(\varepsilon^{-4})$ bound. Moreover, under second-order smoothness conditions, the gradient bias cancels out automatically, enabling the algorithm to attain the optimal rate even without MLMC.

Technology Category

Application Category

📝 Abstract

While standard reinforcement learning optimizes a single reward signal, many applications require optimizing a nonlinear utility $f(J_1^\pi,\dots,J_M^\pi)$ over multiple objectives, where each $J_m^\pi$ denotes the expected discounted return of a distinct reward function. A common approach is concave scalarization, which captures important trade-offs such as fairness and risk sensitivity. However, nonlinear scalarization introduces a fundamental challenge for policy gradient methods: the gradient depends on $\partial f(J^\pi)$, while in practice only empirical return estimates $\hat J$ are available. Because $f$ is nonlinear, the plug-in estimator is biased ($\mathbb{E}[\partial f(\hat J)] \neq \partial f(\mathbb{E}[\hat J])$), leading to persistent gradient bias that degrades sample complexity. In this work we identify and overcome this bias barrier in concave-scalarized multi-objective reinforcement learning. We show that existing policy-gradient methods suffer an intrinsic $\widetilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity due to this bias. To address this issue, we develop a Natural Policy Gradient (NPG) algorithm equipped with a multi-level Monte Carlo (MLMC) estimator that controls the bias of the scalarization gradient while maintaining low sampling cost. We prove that this approach achieves the optimal $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity for computing an $\epsilon$-optimal policy. Furthermore, we show that when the scalarization function is second-order smooth, the first-order bias cancels automatically, allowing vanilla NPG to achieve the same $\widetilde{\mathcal{O}}(\epsilon^{-2})$ rate without MLMC. Our results provide the first optimal sample complexity guarantees for concave multi-objective reinforcement learning under policy-gradient methods.

Problem

Research questions and friction points this paper is trying to address.

multi-objective reinforcement learning

concave scalarization

gradient bias

sample complexity

policy gradient

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Objective Reinforcement Learning

Concave Scalarization

Gradient Bias