SR-Eval: Evaluating LLMs on Code Generation under Stepwise Requirement Refinement

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code generation benchmarks predominantly adopt static, single-turn paradigms, failing to capture the iterative, requirement-evolving nature of real-world software development—thus limiting realistic evaluation of large language models’ (LLMs) practical development support capabilities. Method: We propose SR-Eval, the first multi-turn code generation benchmark explicitly designed for requirement evolution, covering function-level and repository-level tasks in Python and Java. It introduces a novel multi-agent requirement evolution workflow and a semantic-aware, discriminative test case generation method, integrating static analysis and dynamic testing to construct interaction-trace-augmented multi-turn evaluation datasets. Contribution/Results: Evaluated on 11 state-of-the-art LLMs, SR-Eval reveals severe limitations: the highest task completion rates are merely 22.67% (function-level) and 20.00% (repository-level), underscoring critical gaps in current models’ ability to support iterative development.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have achieved remarkable progress in code generation. However, existing benchmarks mainly formalize the task as a static, single-turn problem, overlooking the stepwise requirement changes and iterative workflows in real-world software development. This mismatch limits the understanding of how well LLMs can support real-world development workflows. Constructing such iterative benchmarks is challenging due to the lack of public interaction traces and the difficulty of creating discriminative, turn-specific test cases. To bridge this gap, we present SR-Eval, a benchmark specifically designed to assess LLMs on iterative code generation under Stepwise requirements Refinement. SR-Eval spans both function-level and repository-level tasks in Python and Java, enabling fine-grained and progressive evaluation across evolving requirements. The construction of SR-Eval follows a carefully designed pipeline that first leverages a multi-agent-based requirement generation method to simulate the development process and recover the multi-round interaction process from final requirements, then employs a semantic-aware discriminative test case generation component to ensure discriminative and consistent evaluation at each turn. SR-Eval comprises 443 multi-turn tasks and 1,857 questions at both function and repository levels. Using SR-Eval, we evaluate 11 representative LLMs with three prompting strategies that simulate different usage patterns. Results show that iterative code generation under stepwise requirement refinement remains highly challenging: the best-performing model achieves only 22.67% completion rate on function-level tasks and 20.00% on repository-level tasks. We further observe that prompting strategies substantially influence performance, highlighting the need for the development of advanced methods.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on iterative code generation with evolving requirements
Addressing the gap between static benchmarks and real-world development workflows
Assessing model performance under stepwise requirement refinement scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent requirement generation simulates development process
Semantic-aware test case generation ensures discriminative evaluation
Benchmark spans function and repository levels in Python and Java
🔎 Similar Papers
No similar papers found.