🤖 AI Summary
This work addresses the limitations of existing reinforcement learning reward mechanisms based solely on outcome correctness, which struggle to guide large language models toward generating diverse reasoning paths or emulating human-like multi-solution evaluation and selection. To overcome this, the paper proposes Hint-Guided Diversified Policy Optimization (HDPO), a two-stage “propose–select–reason” framework that first generates multiple candidate solution hints and then selects the most promising for in-depth reasoning. HDPO is the first approach to integrate human-inspired multi-path problem-solving into the reinforcement learning paradigm for large language models, combining structured reasoning cold-starting with a verifiable reward mechanism to explicitly incentivize exploration and identification of reliable solution paths. Experimental results demonstrate that HDPO significantly enhances reasoning performance, solution diversity, and the model’s ability to discriminate trustworthy reasoning trajectories.
📝 Abstract
Recent developments in Large Language Models (LLMs) have showcased impressive reasoning capabilities, with Reinforcement Learning with Verifiable Rewards (RLVR) being a promising enhancement strategy. However, existing reward mechanisms are constrained to the outcome-level correctness and lack explicit signals to guide the model to consider diverse solutions. In contrast, human problem solving typically involves evaluating multiple potential approaches and selecting the most reliable solution, a cognitive process that current RLVR frameworks do not explicitly incentivize. Inspired by this, we propose Hint-Guided Diversified Policy Optimization (HDPO), allowing the model to first list all potential candidate solution outlines as hints and then select the most reliable one for further reasoning. HDPO comprises two stages of Cold Start for Structured Reasoning and Hint-Guided Diversified Reinforcement Learning to incentivize the model to generate diverse and reliable solutions following the ``propose-select-think'' trajectory. Experimental results show that HDPO effectively boosts LLM reasoning and enhances the diversity of candidate solutions as well as the LLM's ability to identify reliable solutions.