Diversity-Incentivized Exploration for Versatile Reasoning

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Large language models (LLMs) suffer from insufficient exploration and low sample efficiency in reasoning tasks due to vast state-action spaces and sparse rewards. Method: We propose DIVER, a reinforcement learning framework that introduces three key components: (i) a latent-function-based reward shaping mechanism that suppresses reward hacking while preserving optimal policy invariance; (ii) verifiable rewards grounded in task correctness; (iii) intrinsic rewards derived from semantic structure space and a global sequence-level diversity metric. Crucially, DIVER is the first to empirically establish a strong positive correlation between global sequence-level diversity and reasoning capability. Results: Experiments demonstrate that DIVER significantly outperforms mainstream RL baselines on both in-domain and out-of-domain reasoning benchmarks, achieving consistent improvements in Pass@1/Pass@k accuracy, sample efficiency, and cross-distribution generalization.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a crucial paradigm for incentivizing reasoning capabilities in Large Language Models (LLMs). Due to vast state-action spaces and reward sparsity in reasoning tasks, existing methods often struggle with deficient exploration and poor sample efficiency. In the paper, we propose extbf{DIVER} ( extbf{D}iversity- extbf{I}ncentivized Exploration for extbf{V}ersatil extbf{E} extbf{R}easoning), an innovative framework that highlights the pivotal role of global sequence-level diversity to incentivize deep exploration for versatile reasoning. We first conduct a primary empirical study to reveal a strong positive correlation between global diversity and reasoning capacity. Building on this insight, we introduce global diversity incentives as an intrinsic reward to promote deep exploration in a semantically structured space. Incorporating the intrinsic reward, we develop a potential-based reward shaping mechanism to preserve optimal policy invariance and design simple heuristics to mitigate possible reward hacking. Experimental results show that DIVER outperforms competitive RLVR baselines with various exploration strategies on both in-domain and out-of-domain tasks, excelling in both Pass@1 and Pass@k evaluations. Our code is available at https://github.com/NJU-RL/DIVER.

Problem

Research questions and friction points this paper is trying to address.

Addresses sparse rewards in reasoning tasks for LLMs

Improves exploration efficiency in vast state-action spaces

Enhances reasoning versatility through diversity incentives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses global sequence-level diversity incentives

Implements potential-based reward shaping mechanism

Applies heuristics to prevent reward hacking

🔎 Similar Papers

Flow of Reasoning:Training LLMs for Divergent Problem Solving with Minimal Examples