Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This work addresses four critical challenges in AI for Software Engineering (AI4SE) benchmarking: fragmented knowledge, difficult benchmark selection, lack of standardized evaluation criteria, and limited benchmark quality. To tackle these, we systematically survey 204 existing benchmarks, exposing their scattered distribution and insufficient construct validity. We then introduce BenchScout—a novel, semantics-driven benchmark retrieval tool (user-rated 4.5/5)—and BenchFrame, an extensible framework for benchmark enhancement. Leveraging these, we design HumanEvalNext, a high-discriminative new benchmark. Empirical evaluation shows that state-of-the-art models exhibit a 19.94%–31.22% drop in pass@1 performance on HumanEvalNext compared to prior benchmarks, significantly improving assessment rigor and model capability differentiation. This work establishes both a methodological foundation and a practical paradigm for AI4SE benchmark development.

Technology Category

Application Category

📝 Abstract

Benchmarks are essential for consistent evaluation and reproducibility. The integration of Artificial Intelligence into Software Engineering (AI4SE) has given rise to numerous benchmarks for tasks such as code generation and bug fixing. However, this surge presents challenges: (1) scattered benchmark knowledge across tasks, (2) difficulty in selecting relevant benchmarks, (3) the absence of a uniform standard for benchmark development, and (4) limitations of existing benchmarks. In this paper, we review 173 studies and identify 204 AI4SE benchmarks. We classify these benchmarks, analyze their limitations, and expose gaps in practices. Based on our review, we created BenchScout, a semantic search tool to find relevant benchmarks, using automated clustering of the contexts from associated studies. We conducted a user study with 22 participants to evaluate BenchScout's usability, effectiveness, and intuitiveness which resulted in average scores of 4.5, 4.0, and 4.1 out of 5. To advance benchmarking standards, we propose BenchFrame, a unified method to enhance benchmark quality. As a case study, we applied BenchFrame to the HumanEval benchmark and addressed its main limitations. This led to HumanEvalNext, featuring (1) corrected errors, (2) improved language conversion, (3) expanded test coverage, and (4) increased difficulty. We then evaluated ten state-of-the-art code language models on HumanEval, HumanEvalPlus, and HumanEvalNext. On HumanEvalNext, models showed a pass@1 score reduction of 31.22% and 19.94% compared to HumanEval and HumanEvalPlus, respectively.

Problem

Research questions and friction points this paper is trying to address.

Addresses scattered benchmark knowledge in AI4SE tasks.

Develops a tool for finding relevant AI4SE benchmarks.

Proposes a method to enhance benchmark quality standards.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed BenchScout for semantic benchmark search.

Proposed BenchFrame to enhance benchmark standards.

Created HumanEvalNext with improved test coverage.

🔎 Similar Papers

A Systematic Literature Review on Explainability for Machine/Deep Learning-based Software Engineering Research