Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

The SWE-Bench public leaderboard lacks transparent documentation for submissions, obscuring the architectures, origins, and technical choices underlying LLM- and agent-based program repair approaches. Method: This work conducts the first systematic empirical attribution analysis of all 147 submissions to SWE-Bench Lite and Verified—spanning 67 distinct solutions—leveraging submission metadata, source code repositories, documentation, and architectural descriptions across multiple dimensions. Results: We identify three key phenomena: (1) dominance of closed-source models (Claude 3.5/3.7), (2) balanced participation from both individual developers and organizations, and (3) coexistence of agentic and non-agentic paradigms. Our study establishes the first reproducible, empirically grounded benchmark characterization of the AI-driven program repair ecosystem, clarifying core distribution patterns—including LLM provenance (open vs. closed), system paradigm, and submitter identity—thereby providing the first evidence-driven, holistic practice map for automated program repair (APR) research.

Technology Category

Application Category

📝 Abstract

The rapid progress in Automated Program Repair (APR) has been driven by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a recent benchmark designed to evaluate LLM-based repair systems using real issues and pull requests mined from 12 popular open-source Python repositories. Its public leaderboards, SWE-Bench Lite and SWE-Bench Verified, have become central platforms for tracking progress and comparing solutions. However, because the submission process does not require detailed documentation, the architectural design and origin of many solutions remain unclear. In this paper, we present the first comprehensive study of all submissions to the SWE-Bench Lite (68 entries) and Verified (79 entries) leaderboards, analyzing 67 unique approaches across dimensions such as submitter type, product availability, LLM usage, and system architecture. Our findings reveal the dominance of proprietary LLMs (especially Claude 3.5/3.7), the presence of both agentic and non-agentic designs, and a contributor base spanning from individual developers to large tech companies.

Problem

Research questions and friction points this paper is trying to address.

Analyze SWE-Bench submissions to understand repair system architectures

Clarify unclear origins and designs of LLM-based repair solutions

Profile submitter types and LLM usage in automated program repair

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing SWE-Bench submissions comprehensively

Dominance of proprietary LLMs like Claude

Mix of agentic and non-agentic designs

🔎 Similar Papers

A Systematic Literature Review on Large Language Models for Automated Program Repair