🤖 AI Summary
Existing code generation benchmarks predominantly adopt static, single-turn paradigms, failing to capture the iterative, requirement-evolving nature of real-world software development—thus limiting realistic evaluation of large language models’ (LLMs) practical development support capabilities.
Method: We propose SR-Eval, the first multi-turn code generation benchmark explicitly designed for requirement evolution, covering function-level and repository-level tasks in Python and Java. It introduces a novel multi-agent requirement evolution workflow and a semantic-aware, discriminative test case generation method, integrating static analysis and dynamic testing to construct interaction-trace-augmented multi-turn evaluation datasets.
Contribution/Results: Evaluated on 11 state-of-the-art LLMs, SR-Eval reveals severe limitations: the highest task completion rates are merely 22.67% (function-level) and 20.00% (repository-level), underscoring critical gaps in current models’ ability to support iterative development.
📝 Abstract
Large language models (LLMs) have achieved remarkable progress in code generation. However, existing benchmarks mainly formalize the task as a static, single-turn problem, overlooking the stepwise requirement changes and iterative workflows in real-world software development. This mismatch limits the understanding of how well LLMs can support real-world development workflows. Constructing such iterative benchmarks is challenging due to the lack of public interaction traces and the difficulty of creating discriminative, turn-specific test cases.
To bridge this gap, we present SR-Eval, a benchmark specifically designed to assess LLMs on iterative code generation under Stepwise requirements Refinement. SR-Eval spans both function-level and repository-level tasks in Python and Java, enabling fine-grained and progressive evaluation across evolving requirements. The construction of SR-Eval follows a carefully designed pipeline that first leverages a multi-agent-based requirement generation method to simulate the development process and recover the multi-round interaction process from final requirements, then employs a semantic-aware discriminative test case generation component to ensure discriminative and consistent evaluation at each turn. SR-Eval comprises 443 multi-turn tasks and 1,857 questions at both function and repository levels. Using SR-Eval, we evaluate 11 representative LLMs with three prompting strategies that simulate different usage patterns. Results show that iterative code generation under stepwise requirement refinement remains highly challenging: the best-performing model achieves only 22.67% completion rate on function-level tasks and 20.00% on repository-level tasks. We further observe that prompting strategies substantially influence performance, highlighting the need for the development of advanced methods.