🤖 AI Summary
This work addresses the challenges of sparse trajectory-level rewards and difficult credit assignment in long-horizon tool-use reinforcement learning, where conventional self-distillation often amplifies harmful shortcuts. The authors propose Sibling-Guided Credit Distillation (SGCD), which innovatively repurposes distillation for credit assignment rather than policy imitation. SGCD dynamically samples paired successful and failed “sibling” trajectories and leverages an external large model to generate step-level credit references. These references guide a divergence-aware teacher–student framework to redistribute dense advantages, augmented with a boundary-constrained, decoupled GRPO advantage function. Notably, the method enhances long-sequence decision-making without requiring deployment of the external model. Experiments on AppWorld and τ³-airline benchmarks demonstrate consistent and significant improvements over the GRPO baseline, with gains in both TGC and pass@1 metrics.
📝 Abstract
Long-horizon tool-use reinforcement learning can learn from outcome verification, but its
trajectory-level advantage is broadcast across many reasoning, API, and answer tokens.
Self-distillation promises a denser signal by reusing a policy's own rollouts or a privileged
teacher. We show, however, that direct token-level self-distillation can silently destroy tool use:
it rehearses teacher behavior without knowing which actions the verifier rewards, so useful skills
and harmful shortcuts are amplified together. We introduce Sibling-Guided Credit Distillation
(SGCD), which uses distillation for credit assignment rather than as a competing actor loss.
Dynamic sampling produces mixed successful and failed sibling rollouts; an external LLM summarizes
their contrast into a training-only stepwise credit reference; dense teacher/student divergence
drives credit reassignment; and bounded detached credit weights reshape GRPO token advantages. The
deployed student sees no external LLM, sibling evidence, or oracle. Across AppWorld and
$τ^3$-airline, SGCD improves over matched GRPO comparators: AppWorld TGC $42.9 \to 45.6$ on
test_normal and $24.7 \to 27.0$ on test_challenge, and $τ^3$-airline pass@1 $0.583 \to 0.602$.