RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository

📅 2026-01-20
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing code generation benchmarks are largely confined to function-level tasks or localized modifications, lacking the capacity to evaluate end-to-end generation of complete microservice repositories from scratch. This work proposes RepoGenesis, the first benchmark for multilingual, end-to-end microservice repository generation, encompassing 106 real-world Python and Java projects spanning 11 frameworks and 18 domains. Data quality is ensured through a rigorous "review-rebuttal" curation process, and novel evaluation metrics—including API coverage and deployment success rate—are introduced. Experimental results show that the best-performing system achieves Pass@1 scores of 23.67% and 21.45% on Python and Java, respectively, with a deployment success rate of up to 100%, though challenges remain in architectural coherence. Notably, GenesisAgent-8B, fine-tuned on this benchmark, matches the performance of GPT-5 mini.

Technology Category

Application Category

📝 Abstract
Large language models and agents have achieved remarkable progress in code generation. However, existing benchmarks focus on isolated function/class-level generation (e.g., ClassEval) or modifications to existing codebases (e.g., SWE-Bench), neglecting complete microservice repository generation that reflects real-world 0-to-1 development workflows. To bridge this gap, we introduce RepoGenesis, the first multilingual benchmark for repository-level end-to-end web microservice generation, comprising 106 repositories (60 Python, 46 Java) across 18 domains and 11 frameworks, with 1,258 API endpoints and 2,335 test cases verified through a"review-rebuttal"quality assurance process. We evaluate open-source agents (e.g., DeepCode) and commercial IDEs (e.g., Cursor) using Pass@1, API Coverage (AC), and Deployment Success Rate (DSR). Results reveal that despite high AC (up to 73.91%) and DSR (up to 100%), the best-performing system achieves only 23.67% Pass@1 on Python and 21.45% on Java, exposing deficiencies in architectural coherence, dependency management, and cross-file consistency. Notably, GenesisAgent-8B, fine-tuned on RepoGenesis (train), achieves performance comparable to GPT-5 mini, demonstrating the quality of RepoGenesis for advancing microservice generation. We release our benchmark at https://github.com/microsoft/DKI_LLM/tree/main/RepoGenesis.
Problem

Research questions and friction points this paper is trying to address.

microservice generation
repository-level code generation
end-to-end development
LLM benchmarking
software engineering
Innovation

Methods, ideas, or system contributions that make the work stand out.

microservice generation
repository-level benchmark
end-to-end code generation
review-rebuttal QA
multilingual LLM evaluation
🔎 Similar Papers
No similar papers found.