A Note on Hybrid Online Reinforcement and Imitation Learning for LLMs: Formulations and Algorithms

๐Ÿ“… 2025-12-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of jointly optimizing imitation learning (IL) and reinforcement learning (RL) in online fine-tuning of large language models (LLMs). We propose a unified framework that enforces trajectory-level KL divergence constraints to preserve imitation fidelity while leveraging task rewards for long-horizon optimization, enabled by gradient decoupling. Our key contribution is the first derivation of a closed-form token-level IL gradient in logit space, decomposing the composite objective into analytically computable dense gradients (for token-level IL) and sparse Monte Carloโ€“estimated gradients (for reward-driven RL), enabling GPU-native efficient online hybrid updates. Experiments on multi-task instruction tuning show that our method reduces policy variance by 30% compared to pure RLHF, significantly improving training stability and sample efficiency, while maintaining high-fidelity behavioral imitation.

Technology Category

Application Category

๐Ÿ“ Abstract
We present a unified framework for Large Language Model (LLM) fine-tuning that integrates Imitation Learning and Reinforcement Learning. By analyzing the gradient of a composite objective combining trajectory-level KL divergence with task rewards, we derive a natural decomposition into two components: (1) an analytically computable Dense Gradient for token-level imitation, and (2) a Monte Carlo estimated Sparse Gradient for long-horizon reward optimization. The Dense Gradient admits a closed-form logit-level formula, enabling efficient GPU implementation.
Problem

Research questions and friction points this paper is trying to address.

Integrates Imitation and Reinforcement Learning for LLM fine-tuning
Decomposes gradient into dense imitation and sparse reward components
Enables efficient GPU implementation via closed-form logit-level formula
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines imitation and reinforcement learning for LLM fine-tuning
Decomposes gradient into dense imitation and sparse reward components
Uses closed-form logit-level formula for efficient GPU implementation
๐Ÿ”Ž Similar Papers
No similar papers found.