Agentic Rubrics as Contextual Verifiers for SWE Agents

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the lack of efficient, scalable, and context-aware validation mechanisms in current software engineering agents (SWE Agents). Traditional execution-based verification struggles to scale, while alternative approaches often neglect repository-level context. To overcome these limitations, we propose Agentic Rubrics—a novel framework in which expert agents interact with code repositories to automatically generate context-sensitive rubrics that evaluate the quality of candidate patches without requiring test execution. Our approach introduces, for the first time, agent-generated, fine-grained, and interpretable scoring criteria into the SWE validation pipeline, achieving both scalability and strong generalization. Evaluated on SWE-Bench Verified, our method attains scores of 54.2% and 40.6% using Qwen3-Coder-30B-A3B and Qwen3-32B, respectively, outperforming the strongest baseline by at least 3.5 percentage points.

Technology Category

Application Category

📝 Abstract

Verification is critical for improving agents: it provides the reward signal for Reinforcement Learning and enables inference-time gains through Test-Time Scaling (TTS). Despite its importance, verification in software engineering (SWE) agent settings often relies on code execution, which can be difficult to scale due to environment setup overhead. Scalable alternatives such as patch classifiers and heuristic methods exist, but they are less grounded in codebase context and harder to interpret. To this end, we explore Agentic Rubrics: an expert agent interacts with the repository to create a context-grounded rubric checklist, and candidate patches are then scored against it without requiring test execution. On SWE-Bench Verified under parallel TTS evaluation, Agentic Rubrics achieve a score of 54.2% on Qwen3-Coder-30B-A3B and 40.6% on Qwen3-32B, with at least a +3.5 percentage-point gain over the strongest baseline in our comparison set. We further analyze rubric behavior, showing that rubric scores are consistent with ground-truth tests while also flagging issues that tests do not capture. Our ablations show that agentic context gathering is essential for producing codebase-specific, unambiguous criteria. Together, these results suggest that Agentic Rubrics provide an efficient, scalable, and granular verification signal for SWE agents.

Problem

Research questions and friction points this paper is trying to address.

verification

software engineering agents

code execution

scalability

contextual understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Rubrics

Test-Time Scaling

SWE Agents