DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

145K/year
🤖 AI Summary
This work addresses the challenges in optimizing large language models for multi-turn interactive settings, where online reinforcement learning incurs high computational costs and offline supervised fine-tuning suffers from distributional shift and policy collapse. The authors propose DRIFT, a novel framework that reformulates the KL-regularized reinforcement learning objective as an importance-weighted supervised learning problem, thereby decoupling trajectory generation from policy optimization. Specifically, DRIFT employs a fixed reference policy to sample offline trajectories, computes importance weights based on trajectory returns, and performs supervised fine-tuning on the resulting weighted dataset. This approach retains the training efficiency of standard supervised fine-tuning while achieving performance comparable to or exceeding that of multi-turn reinforcement learning baselines.
📝 Abstract
Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.
Problem

Research questions and friction points this paper is trying to address.

multi-turn optimization
reinforcement learning
supervised fine-tuning
distribution shift
behavioral collapse
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Rollouts
Importance-Weighted Fine-Tuning
Multi-Turn Optimization
KL-Regularized RL
Offline Policy Learning
🔎 Similar Papers
2024-08-10AAAI Conference on Artificial IntelligenceCitations: 30