Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing general-purpose agents, such as OpenClaw, cannot be fairly evaluated for code generation due to their incompatibility with SWE-bench’s standardized evaluation protocol—including requirements like Dockerized workspaces and patch formatting. To address this, this work introduces Claw-SWE-Bench, a multilingual, standardized benchmark platform that enables fair comparison of heterogeneous agent adapters (claws) through unified prompting, runtime budgets, workspace contracts, patch extraction, and evaluation pipelines. For the first time, adapter design and API cost are incorporated as first-class dimensions in SWE task evaluation. The benchmark is built on a multilingual dataset of GitHub repair instances and employs cost- and rank-aware sampling strategies. Evaluated on 350 instances, a full adapter boosts OpenClaw’s Pass@1 from 19.1% to 73.4%, with model and adapter choices accounting for performance differences of 29.4 and 27.4 percentage points, respectively, alongside substantial variation in API costs.

📝 Abstract

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only $19.1\%$ Pass@1, whereas the full adapter reaches $73.4\%$ with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw $\times$ nine-model sweep and a five-claw $\times$ two-model sweep, model choice changes Pass@1 by $29.4$ pp and harness choice by $27.4$ pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-bench and https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.

Problem

Research questions and friction points this paper is trying to address.

agent evaluation

coding benchmark

SWE-bench

tool-using agents

harness comparability

Innovation

Methods, ideas, or system contributions that make the work stand out.

agent harness

code generation benchmark

adapter protocol