Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In verifiable reward reinforcement learning (RLVR), large reasoning models often converge prematurely to suboptimal solutions due to excessive conservatism. To address this, we propose the first Pass@k–guided RLVR framework that integrates the Pass@k metric directly into training—not merely evaluation—to adaptively balance exploration and exploitation. Our key contributions are threefold: (1) We theoretically and empirically demonstrate that exploration and exploitation can exhibit positive synergy in RLVR; (2) We design an analytically tractable advantage function explicitly capturing the gradient structure of Pass@k rewards, enabling efficient and stable policy optimization; (3) The method significantly improves output diversity and performance on mathematical reasoning and code generation tasks. Experiments confirm substantial mitigation of premature local convergence, validating the effectiveness and promise of explicit advantage function design in RLVR.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with verifiable rewards (RLVR), which typically adopts Pass@1 as the reward, has faced the issues in balancing exploration and exploitation, causing policies to prefer conservative actions, converging to a local optimum. Identifying an appropriate reward metric is therefore crucial. Regarding the prior work, although Pass@k has been used in evaluation, its connection to LLM exploration ability in RLVR remains largely overlooked. To investigate this, we first use Pass@k as the reward to train the policy model (i.e., $ extbf{Pass@k Training}$), and observe the improvement on its exploration ability. Next, we derive an analytical solution for the advantage of Pass@k Training, leading to an efficient and effective process. Building on this, our analysis reveals that exploration and exploitation are not inherently conflicting objectives, while they can mutually enhance each other. Moreover, Pass@k Training with analytical derivation essentially involves directly designing the advantage function. Inspired by this, we preliminarily explore the advantage design for RLVR, showing promising results and highlighting a potential future direction.
Problem

Research questions and friction points this paper is trying to address.

Balancing exploration and exploitation in RLVR with Pass@k
Improving exploration ability using Pass@k as reward
Designing advantage function for RLVR efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Pass@k as reward for training
Derives analytical solution for advantage
Directly designs advantage function
🔎 Similar Papers
No similar papers found.
Z
Zhipeng Chen
Renmin University of China, ByteDance Seed
X
Xiaobo Qin
ByteDance Seed
Y
Youbin Wu
ByteDance Seed
Y
Yue Ling
ByteDance Seed
Qinghao Ye
Qinghao Ye
ByteDance Ltd.; University of California, San Diego
Computer VisionMultimodal LearningVideo Understanding
Wayne Xin Zhao
Wayne Xin Zhao
Professor, Renmin University of China
Recommender SystemNatural Language ProcessingLarge Language Model
G
Guang Shi
ByteDance Seed