Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This study clarifies the relationship between large language models’ task-solving capabilities and their capacity for external harness-based self-evolution. Through controlled experiments, a novel harness-updating evaluation framework, execution log analysis, and instruction-following diagnostics, it reveals for the first time that the ability to generate effective harness updates (harness-updating) is independent of base model performance—e.g., updates produced by Qwen3.5-9B rival those from Claude Opus. In contrast, the benefit derived from such updates (harness-benefit) exhibits a non-monotonic relationship with model capability: models of intermediate strength gain the most, while weaker models benefit little due to difficulties in activating or adhering to the updated harness. These findings offer a new strategy for capability allocation in multi-agent systems.

📝 Abstract

LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Harness self-evolution adapts such agents by updating these harnesses from execution evidence. Yet it remains unclear whether a model's base capability in task-solving predicts its capabilities in harness self-evolution: which models produce useful harness updates, and which actually benefit from them? We analyze two harness self-evolution capabilities: (i) harness-updating, the capability to produce useful persistent harness updates from execution evidence; (ii) harness-benefit, the capability to benefit from updated harnesses during task solving. Our analysis reveals two findings. First, harness-updating is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5-9B's updates yield gains comparable to those of Claude Opus~4.6. Second, harness-benefit is non-monotonic in base capability: weak-tier models benefit little from updated harnesses, mid-tier models benefit most, and strong-tier models benefit less than mid-tier. We trace low gains at the weak tier to two failure modes: weak-tier models may fail to activate relevant harness artifacts, or activate them but fail to follow them faithfully. These findings suggest investing capability budget in the task-solving agent rather than the evolver, and targeting harness invocation and long-horizon instruction following in agent training. Our source code is publicly available at https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution.

Problem

Research questions and friction points this paper is trying to address.

harness self-evolution

harness-updating

harness-benefit

LLM agents

base capability

Innovation

Methods, ideas, or system contributions that make the work stand out.

harness self-evolution

harness-updating

harness-benefit