π€ AI Summary
Existing trajectory synthesis methods predominantly focus on write-intensive, multi-turn tasks, overlooking the read-intensive challenges posed by high evidential loads in single-decision scenarios. This work proposes WRIT, a novel framework that decouples trajectory complexity into two orthogonal axes: the number of write decisions and the evidential burden per decision. WRIT systematically generates writeβread-intensive training trajectories through task generation, diverse user behavior modeling, and executable environment simulation, establishing an efficient synthetic pipeline. Remarkably, a 4B-parameter model trained on only 2K WRIT-generated trajectories outperforms GPT-5.1 no-think on the ΟΒ²-bench while substantially reducing inference token consumption, thereby demonstrating the efficacy of evidence-driven decision modeling.
π Abstract
Multi-turn user-facing agents must infer user intent from incomplete requests, collect missing information through dialogue and tools, and execute valid actions. A training trajectory records this process as an interleaved sequence of user messages, agent responses, tool calls, etc. Synthesizing sufficiently complex trajectory has become a central route to train agents: existing pipelines often increase difficulty by composing multiple user requests into longer tasks, producing write-intensive trajectories that train sequential execution.
We argue that a single write decision can itself be difficult when the agent must gather and compare substantial read-tool evidence before its arguments become identifiable, a challenge that write-intensive data alone cannot address. Guided by this insight, we propose WRIT (\uline{W}rite-\uline{R}ead \uline{I}ntensive \uline{T}rajectory Synthesis), a pipeline for synthesizing multi-turn agent training trajectories along two complexity axes: the number of write decisions in a task and the evidence burden of each individual decision. WRIT first generates write-intensive and read-heavy tasks. It then diversifies user behavior instructions to reflect realistic conversational variation, and finally simulates agent-user interactions in an executable environment to produce complete training trajectories. The resulting data trains agents not only for longer task execution, but also for robust, evidence-grounded decision making under high information load. With only 2K synthesized trajectories, a 4B model trained on WRIT outperforms GPT-5.1 no-think on $Ο^2$-bench and substantially reduces inference-time token usage, showing that compact SFT data can convert part of expensive test-time reasoning into efficient agent behavior.