Efficient Hyperparameter Optimization for LLM Reinforcement Learning

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the inefficiency of hyperparameter optimization (HPO) in large language model reinforcement learning (LLM RL), which stems from the enormous model scale and prohibitive training costs. The authors propose Joint-Fidelity Hyperparameter Optimization (JF-HPO), the first method to jointly model model size and training budget as dual fidelity dimensions. JF-HPO leverages small-scale proxy models for efficient hyperparameter search and integrates a training-dynamics-based early-stopping strategy with checkpoint reuse to minimize redundant computation. Experimental results demonstrate that JF-HPO achieves up to a 14.9× speedup per optimization run and attains equal or superior performance under identical time budgets. Compared to the VeRL Recipe configuration, it improves accuracy by 5.8% to 111.6%.

📝 Abstract

Reinforcement learning (RL) for large language models (LLMs) is highly sensitive to hyperparameter configurations, making hyperparameter optimization (HPO) essential yet computationally expensive. Existing multi-fidelity HPO methods remain inefficient for LLM RL due to the massive model scale and resource-intensive training cycles. In this paper, we propose Joint Fidelity Hyperparameter Optimization (JF-HPO), which simultaneously adapts both model size and training budget as fidelity. JF-HPO is empowered by: (i) it leverages a small proxy model of the target LLM for efficient training and evaluation in each HPO trial; (ii) it integrates carefully designed early-stopping strategies based on training dynamics; (iii) it introduces an efficient checkpointing mechanism to eliminate redundant computations. Compared with existing HPO methods, JF-HPO significantly improves the computational efficiency of each trial (up to 14.9 times), while achieving better or competitive predictive accuracy under the same time budget. Notably, compared with utilizing hyperparameter configurations from the VeRL Recipe, JF-HPO delivers performance improvements ranging from 5.8% to 111.6%.

Problem

Research questions and friction points this paper is trying to address.

Hyperparameter Optimization

Large Language Models

Reinforcement Learning

Multi-fidelity

Computational Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint Fidelity Hyperparameter Optimization

Large Language Models

Reinforcement Learning