What Makes Interaction Trajectories Effective for Training Terminal Agents?

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This study investigates whether high-performance code agents are suitable as teachers for post-training terminal agents, uncovering a “teaching paradox”: teacher trajectories that are lower-performing yet structurally clearer lead to better student generalization. To address this, the authors propose Environment-Grounded Supervision (EGS) and a novel paradigm termed “Harness Engineering,” which leverages the Terminal-Lego pipeline to reformulate multi-domain problems into environment-verifiable tasks. They employ an inspect-act-verify interaction structure for trajectory-supervised fine-tuning. Using only 15.3k trajectories, their approach enables Qwen3-32B to achieve a 24.3% score on Terminal-Bench 2.0—matching prior state-of-the-art performance obtained with over 30 times more data—demonstrating both high sample efficiency and strong generalization capability.

📝 Abstract

Stronger code agents are commonly assumed to be superior teachers for post-training, yet this assumption remains poorly disentangled from task difficulty, harness design, and student capacity. We investigate this pedagogical link using Terminal-Lego, a scalable pipeline that transforms multi-domain real-world issues into environment-verified agentic tasks. Surprisingly, standalone performance does not dictate teaching efficacy: while Claude Opus 4.6 achieves higher scores on Terminal-Bench 2.0, students fine-tuned on trajectories from DeepSeek-V3.2, a lower-scoring agent, exhibit significantly stronger generalization. We attribute this "pedagogical paradox" to Environment-Grounded Supervision (EGS): trajectories that explicitly expose inspect-act-verify behaviors through harness-visible interactions allow students to internalize robust problem-solving routines rather than fragile action sequences. Scaling analysis reveals exceptional data efficiency: with only 15.3k Terminal-Lego trajectories, for example, Qwen3-32B achieves a 24.3% score on Terminal-Bench 2.0, rivaling previous SOTA performance established with over 30x the data volume. Our results suggest that the frontier of agent post-training lies beyond mere outcome-matching, shifting the focus toward "Harness Engineering", where the systematic design of environment-grounded interaction structures serves as the primary catalyst for reproducible and generalizable agentic intelligence.

Problem

Research questions and friction points this paper is trying to address.

interaction trajectories

agent training

pedagogical efficacy

environment-grounded supervision

harness engineering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Environment-Grounded Supervision

Terminal-Lego

Harness Engineering