Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges of sparse trajectory-level rewards and difficult credit assignment in long-horizon tool-use reinforcement learning, where conventional self-distillation often amplifies harmful shortcuts. The authors propose Sibling-Guided Credit Distillation (SGCD), which innovatively repurposes distillation for credit assignment rather than policy imitation. SGCD dynamically samples paired successful and failed “sibling” trajectories and leverages an external large model to generate step-level credit references. These references guide a divergence-aware teacher–student framework to redistribute dense advantages, augmented with a boundary-constrained, decoupled GRPO advantage function. Notably, the method enhances long-sequence decision-making without requiring deployment of the external model. Experiments on AppWorld and τ³-airline benchmarks demonstrate consistent and significant improvements over the GRPO baseline, with gains in both TGC and pass@1 metrics.

📝 Abstract

Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by reusing a policy's own rollouts or a privileged teacher. We show, however, that direct token-level self-distillation can silently destroy tool use: it rehearses teacher behavior without knowing which actions the verifier rewards, so useful skills and harmful shortcuts are amplified together. We introduce Sibling-Guided Credit Distillation (SGCD), which uses distillation for credit assignment rather than as a competing actor loss. Dynamic sampling produces mixed successful and failed sibling rollouts; an external LLM summarizes their contrast into a training-only stepwise credit reference; dense teacher/student divergence drives credit reassignment; and bounded detached credit weights reshape GRPO token advantages. The deployed student sees no external LLM, sibling evidence, or oracle. Across AppWorld and $τ^3$-airline, SGCD improves over matched GRPO comparators: AppWorld TGC $42.9 \to 45.6$ on test_normal and $24.7 \to 27.0$ on test_challenge, and $τ^3$-airline pass@1 $0.583 \to 0.602$.

Problem

Research questions and friction points this paper is trying to address.

long-horizon tool-use

reinforcement learning

self-distillation

credit assignment

outcome verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

credit assignment

self-distillation

tool-use reinforcement learning