D2C-HRHR: Discrete Actions with Double Distributional Critics for High-Risk-High-Return Tasks

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

High-Risk High-Reward (HRHR) tasks exhibit multimodal action distributions and highly stochastic returns, yet mainstream reinforcement learning (RL) methods rely on unimodal Gaussian policies and scalar critics—leading to poor convergence and inadequate risk modeling. Method: We formally define HRHR tasks and prove that Gaussian policies cannot guarantee convergence to optimal solutions. To address this, we propose a novel distributional RL framework: (i) explicitly approximating multimodal policies via discretization of the continuous action space; (ii) introducing a dual-distribution critic that separately models the expectation and risk-sensitive distribution of action values; and (iii) incorporating entropy regularization to enhance exploration. Contribution/Results: Experiments on high-failure-risk locomotion and robotic manipulation tasks demonstrate significant improvements over state-of-the-art baselines. Our results validate that explicit multimodal policy approximation and distributional risk-aware value estimation are essential for robust decision-making in HRHR settings.

Technology Category

Application Category

📝 Abstract

Tasks involving high-risk-high-return (HRHR) actions, such as obstacle crossing, often exhibit multimodal action distributions and stochastic returns. Most reinforcement learning (RL) methods assume unimodal Gaussian policies and rely on scalar-valued critics, which limits their effectiveness in HRHR settings. We formally define HRHR tasks and theoretically show that Gaussian policies cannot guarantee convergence to the optimal solution. To address this, we propose a reinforcement learning framework that (i) discretizes continuous action spaces to approximate multimodal distributions, (ii) employs entropy-regularized exploration to improve coverage of risky but rewarding actions, and (iii) introduces a dual-critic architecture for more accurate discrete value distribution estimation. The framework scales to high-dimensional action spaces, supporting complex control domains. Experiments on locomotion and manipulation benchmarks with high risks of failure demonstrate that our method outperforms baselines, underscoring the importance of explicitly modeling multimodality and risk in RL.

Problem

Research questions and friction points this paper is trying to address.

Modeling multimodal action distributions in high-risk tasks

Overcoming limitations of Gaussian policies in RL

Improving value estimation for risky action spaces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discretizes continuous action spaces for multimodal distributions

Uses entropy-regularized exploration for risky actions coverage

Introduces dual-critic architecture for discrete value estimation

🔎 Similar Papers

Policy Gradient Methods for Risk-Sensitive Distributional Reinforcement Learning with Provable Convergence