Reinforcement Learning from Rich Feedback with Distributional DAgger

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
Traditional reinforcement learning relies solely on sparse, binary final rewards, making it difficult to leverage the rich intermediate feedback available during reasoning processes. This work proposes DistIL, a novel approach that, for the first time, integrates distributed expert feedback with a forward cross-entropy objective within the DAgger framework to enable sequence-level credit assignment. DistIL effectively fuses multi-dimensional signals—such as execution trajectories and tool outputs—through distributional imitation learning, forward KL optimization, and black-box interactions with an expert policy. The method provides theoretical guarantees of monotonic policy improvement and establishes a regret bound. Empirical results demonstrate that DistIL significantly outperforms RLVR and self-distillation baselines across scientific reasoning, code generation, and complex mathematical tasks, achieving substantial gains in Pass@N metrics.
📝 Abstract
Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.
Problem

Research questions and friction points this paper is trying to address.

rich feedback
reinforcement learning
imitation learning
distributional DAgger
credit assignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributional DAgger
Forward Cross-Entropy
Rich Feedback
Monotonic Policy Improvement
Reinforcement Learning from Human Feedback