Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the challenge that existing large language models struggle to decouple planning from execution in deep research tasks, leading to ambiguous credit assignment and optimization difficulties. To resolve this, the authors propose DecomposeR, a novel framework that explicitly models research plans as typed directed acyclic graphs (DAGs) and employs a two-stage reinforcement learning approach to separately optimize a planner and an answerer. In the first stage, the planner is trained to generate query decompositions and corresponding DAG structures; in the second, the answerer executes retrieval and synthesis following the graph. The method introduces fine-grained rewards targeting planner tokens and graph components, enabling effective decoupled optimization. Evaluated on standard long-form question answering benchmarks, DecomposeR outperforms strong open-source baselines by 5.1–8.0 points, significantly improving both planning quality and final answer accuracy.

📝 Abstract

Deep research tasks require LLMs to plan what to investigate, retrieve evidence, and synthesize long-form answers across multiple branches of inquiry. Existing training paradigms either rely on short-form verifiable QA as a proxy or optimize monolithic long trajectories, which makes planning and execution difficult to disentangle and yields weak credit assignment for the planning process. We propose DecomposeR, a planner-centric deep research framework that represents research plans as typed directed acyclic graphs (DAGs), allowing planning to be made explicit, structured, and rewardable. We train a Qwen3-8B model in two stages: planner reinforcement learning (RL) first learns graph structure and query decomposition to improve research planning, and answerer reinforcement learning (RL) then learns branch-level execution and final synthesis conditioned on the learned plan. By assigning rewards to explicit planner tokens and structured components rather than to a flat trajectory, DecomposeR enables finer-grained optimization of planning while reducing the ambiguity of end-to-end training. Experiments show that DecomposeR-8B improves over strong comparable open baselines by 5.1-8.0 points on popular long-form benchmarks due to improved planning and answering capabilities.

Problem

Research questions and friction points this paper is trying to address.

deep research

planning

credit assignment

long-form QA

structured reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

planner-centric reinforcement learning

structured research planning

typed DAG