🤖 AI Summary
Current large language models struggle to accomplish long-horizon tasks such as generating complete software repositories from high-level requirements, primarily due to the absence of large-scale, verifiable full-repository generation data. This work proposes the first scalable automated pipeline that efficiently constructs high-quality datasets without human annotation by integrating sandboxed agent workflows, a divide-and-conquer architecture, a critique-and-repair mechanism, and difficulty-aware trajectory filtering. Fine-tuning the Qwen3-30B-A3B model on this dataset yields a substantial improvement on the BeyondSWE-Doc2Repo benchmark, increasing task success rate from 5.8% to 47.2%. These results demonstrate the effectiveness of the approach in balancing diversity and quality, and its capability to support complex, long-horizon software engineering tasks.
📝 Abstract
As the capabilities of LLM-based code agents continue to advance, their expected role is expanding beyond localized bug fixing in existing codebases toward architecting and implementing complete software repositories from high-level specifications. However, training agents for such long-horizon software engineering tasks remains difficult due to the scarcity of large-scale, verifiable whole-repository generation data. In this paper, we introduce \textbf{DeNovoSWE}, a large-scale dataset for whole-repository generation. DeNovoSWE comprises 4,818 high-quality instances, where each instance requires generating a complete repository from documentation. Our dataset is automatically constructed through a carefully designed sandboxed agentic workflow, enabling scalable curation without human annotation. DeNovoSWE is constructed with "divide and conquer" and critic-repair philosophy. To balance data quality and diversity, we further introduce a difficulty-aware trajectory filtering strategy. Fine-tuning Qwen3-30B-A3B on DeNovoSWE substantially improves long-horizon SWE performance, raising its score on the challenging BeyondSWE-Doc2Repo benchmark from 5.8% to 47.2%.