LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the vulnerability of large language models to distraction in long-context reasoning, where precise localization and integration of critical information remain challenging. To this end, the authors propose a hierarchical adversarial document construction method that leverages search agent trajectories to generate highly confounding distractors and employs knowledge graph random walks to create multi-hop questions. They introduce a novel, fine-grained entity-level scoring rule that evaluates reasoning chains exclusively based on correctness of the final answer, enabling reinforcement learning to supervise intermediate reasoning steps. This approach effectively discriminates between high- and low-quality reasoning while mitigating reward hacking. Evaluated across five long-context benchmarks, the method consistently outperforms strong baselines on models ranging from 4B to 30B parameters, facilitating more comprehensive, evidence-based reasoning.

📝 Abstract

Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce \textsc{LongTraceRL}. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build \emph{tiered distractors}: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a \emph{rubric reward} that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that \textsc{LongTraceRL} consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at \href{https://github.com/THU-KEG/LongTraceRL}{https://github.com/THU-KEG/LongTraceRL}.

Problem

Research questions and friction points this paper is trying to address.

long-context reasoning

distractor confusability

reinforcement learning

intermediate reasoning supervision

reward sparsity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-context reasoning

Reinforcement learning

Rubric rewards