🤖 AI Summary
This study investigates whether high-performance code agents are suitable as teachers for post-training terminal agents, uncovering a “teaching paradox”: teacher trajectories that are lower-performing yet structurally clearer lead to better student generalization. To address this, the authors propose Environment-Grounded Supervision (EGS) and a novel paradigm termed “Harness Engineering,” which leverages the Terminal-Lego pipeline to reformulate multi-domain problems into environment-verifiable tasks. They employ an inspect-act-verify interaction structure for trajectory-supervised fine-tuning. Using only 15.3k trajectories, their approach enables Qwen3-32B to achieve a 24.3% score on Terminal-Bench 2.0—matching prior state-of-the-art performance obtained with over 30 times more data—demonstrating both high sample efficiency and strong generalization capability.
📝 Abstract
Stronger code agents are commonly assumed to be superior teachers for post-training, yet this assumption remains poorly disentangled from task difficulty, harness design, and student capacity. We investigate this pedagogical link using Terminal-Lego, a scalable pipeline that transforms multi-domain real-world issues into environment-verified agentic tasks. Surprisingly, standalone performance does not dictate teaching efficacy: while Claude Opus 4.6 achieves higher scores on Terminal-Bench 2.0, students fine-tuned on trajectories from DeepSeek-V3.2, a lower-scoring agent, exhibit significantly stronger generalization. We attribute this "pedagogical paradox" to Environment-Grounded Supervision (EGS): trajectories that explicitly expose inspect-act-verify behaviors through harness-visible interactions allow students to internalize robust problem-solving routines rather than fragile action sequences. Scaling analysis reveals exceptional data efficiency: with only 15.3k Terminal-Lego trajectories, for example, Qwen3-32B achieves a 24.3% score on Terminal-Bench 2.0, rivaling previous SOTA performance established with over 30x the data volume. Our results suggest that the frontier of agent post-training lies beyond mere outcome-matching, shifting the focus toward "Harness Engineering", where the systematic design of environment-grounded interaction structures serves as the primary catalyst for reproducible and generalizable agentic intelligence.