RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

This work addresses the limitation of existing agent evaluation benchmarks, which fail to capture the authentic interaction dynamics between developers and AI agents and inadequately reflect the distribution, diversity, and complexity of real-world tasks. To bridge this gap, the authors introduce a dynamic benchmark constructed from a large-scale corpus of real OpenClaw developer sessions. By leveraging conversation reconstruction, execution environment snapshots, and a deterministic validation mechanism, raw user requests are transformed into reproducible, automatically evaluable tasks. Sampling bias is mitigated through Jensen–Shannon divergence–based control to preserve the true task distribution. The resulting benchmark comprises 281 executable tasks; evaluations of 14 leading models reveal that even the best-performing system solves only 65.8% of them, highlighting a significant capability gap in current agents when operating in realistic software development scenarios.

📝 Abstract

Agent benchmarks should reflect what users actually ask deployed agents to do, yet existing benchmarks often miss key realism properties of real developer-agent sessions. We introduce RealClawBench, a live benchmark framework built from real OpenClaw sessions to capture the distribution, diversity, and real-world difficulty of deployed agent use. Real user requests are challenging to benchmark because they often depend on local execution environments, involve implicit or underspecified intent, and require nontrivial verification. RealClawBench addresses these challenges with two core mechanisms: reconstructed execution environments and deterministic verifiable scorers, which together convert real sessions into reproducible, automatically scored tasks. The resulting release contains 281 executable tasks sampled from a much larger real-session pool while preserving the source distribution, with maximum final-vs-source Jensen-Shannon divergence of 0.0448. Evaluating 14 contemporary models shows that the best system solves only 65.8% of tasks, revealing substantial headroom on realistic developer-agent workloads. By turning real deployed sessions into controlled evaluation instances, RealClawBench provides a practical path toward benchmarks that better measure agent capability in actual use. Code is available at:https://anonymous.4open.science/r/real-claw-bench-582B.

Problem

Research questions and friction points this paper is trying to address.

agent benchmarks

realism

developer-agent sessions

real-world difficulty

task distribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

RealClawBench

developer-agent sessions

reconstructed execution environments