From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Current large language model (LLM) agents struggle to precisely localize harness defects responsible for unreliable behaviors within failed execution trajectories, leading to broad and inefficient remediation strategies. This work proposes HarnessFix, a novel framework that enables the first precise diagnosis and structured repair of harness defects based on execution traces. By constructing a harness-aware trajectory intermediate representation (HTIR), HarnessFix integrates step-level provenance tracking, control-flow analysis, and defect aggregation to fine-grainedly attribute faulty behaviors to specific steps and harness components, subsequently generating specification-guided repair patches. Experimental results demonstrate that HarnessFix achieves performance gains of 15.2%–50.0% across four benchmarks, including SWE-Bench Verified, significantly outperforming both handcrafted and self-evolution baselines, while uncovering recurrent harness defect patterns in the ETCLOVG architecture.

📝 Abstract

LLM-based agents increasingly rely on harnesses that provide execution environments, tool interfaces, context, lifecycle orchestration, observability, verification, and governance. Existing self-improving agents and automatic harness evolution methods mainly improve agents through runtime supervision, prompt optimization, workflow search, or harness modification based on final outcomes. However, they often fail to diagnose where the responsible evidence lies in failed trajectories and which harness layer causes the unreliable behavior, resulting in broad, indirect, or poorly scoped changes. This paper proposes HarnessFix, a trace-guided framework for diagnosing agent failures and repairing agent harnesses. HarnessFix compiles raw execution traces and harness code into a Harness-aware Trace Intermediate Representation (HTIR), which normalizes fragmented trajectory evidence and captures step-level provenance and control-flow relations. It then attributes failures to responsible trajectory steps and harness layers, consolidates recurring diagnoses into actionable flaw records, and maps them to scoped repair operators. Finally, HarnessFix generates and validates harness patches under flaw-specific repair specifications to reduce target flaws without introducing unacceptable regressions. We evaluate HarnessFix on SWE-Bench Verified, Terminal-Bench 2.0 Verified, GAIA and AppWorld. Across these benchmarks, HarnessFix improves held-out test performance over the initial harnesses by 15.2%--50.0%, outperforms human-designed and self-evolution baselines, and reveals recurring harness-flaw patterns across ETCLOVG layers.

Problem

Research questions and friction points this paper is trying to address.

LLM agents

harness flaws

failure diagnosis

execution traces

reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

HarnessFix

trace-guided diagnosis

HTIR