Hint-Guided Diversified Policy Optimization for LLM Reasoning

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the limitations of existing reinforcement learning reward mechanisms based solely on outcome correctness, which struggle to guide large language models toward generating diverse reasoning paths or emulating human-like multi-solution evaluation and selection. To overcome this, the paper proposes Hint-Guided Diversified Policy Optimization (HDPO), a two-stage “propose–select–reason” framework that first generates multiple candidate solution hints and then selects the most promising for in-depth reasoning. HDPO is the first approach to integrate human-inspired multi-path problem-solving into the reinforcement learning paradigm for large language models, combining structured reasoning cold-starting with a verifiable reward mechanism to explicitly incentivize exploration and identification of reliable solution paths. Experimental results demonstrate that HDPO significantly enhances reasoning performance, solution diversity, and the model’s ability to discriminate trustworthy reasoning trajectories.

📝 Abstract

Recent developments in Large Language Models (LLMs) have showcased impressive reasoning capabilities, with Reinforcement Learning with Verifiable Rewards (RLVR) being a promising enhancement strategy. However, existing reward mechanisms are constrained to the outcome-level correctness and lack explicit signals to guide the model to consider diverse solutions. In contrast, human problem solving typically involves evaluating multiple potential approaches and selecting the most reliable solution, a cognitive process that current RLVR frameworks do not explicitly incentivize. Inspired by this, we propose Hint-Guided Diversified Policy Optimization (HDPO), allowing the model to first list all potential candidate solution outlines as hints and then select the most reliable one for further reasoning. HDPO comprises two stages of Cold Start for Structured Reasoning and Hint-Guided Diversified Reinforcement Learning to incentivize the model to generate diverse and reliable solutions following the ``propose-select-think'' trajectory. Experimental results show that HDPO effectively boosts LLM reasoning and enhances the diversity of candidate solutions as well as the LLM's ability to identify reliable solutions.

Problem

Research questions and friction points this paper is trying to address.

LLM reasoning

reward mechanism

solution diversity

reinforcement learning

verifiable rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hint-Guided

Diversified Policy Optimization

LLM Reasoning