🤖 AI Summary
Existing code generation benchmarks inadequately evaluate coding agents’ capability to implement features under “vibe coding”—a paradigm grounded in natural-language instructions rather than formal code-level specifications or isolated problem-solving tasks, thus failing to reflect real-world feature development practices.
Method: We introduce FeatBench, the first benchmark explicitly designed to assess feature implementation ability, featuring purely natural-language instructions, diverse application scenarios, and rigorous testing protocols.
Contribution/Results: FeatBench introduces a high-quality, code-free, automatically evolvable dataset built via multi-stage filtering and automated pipelines; it defines two novel test paradigms—Function-to-Program (F2P) and Program-to-Program (P2P). Empirical evaluation reveals that state-of-the-art agents achieve only 29.94% success rate in feature implementation, exposing the double-edged nature of aggressive implementation strategies. All tools and data are publicly released.
📝 Abstract
The rapid advancement of Large Language Models (LLMs) has given rise to a novel software development paradigm known as "vibe coding," where users interact with coding agents through high-level natural language. However, existing evaluation benchmarks for code generation inadequately assess an agent's vibe coding capabilities. Existing benchmarks are misaligned, as they either require code-level specifications or focus narrowly on issue-solving, neglecting the critical scenario of feature implementation within the vibe coding paradiam. To address this gap, we propose FeatBench, a novel benchmark for vibe coding that focuses on feature implementation. Our benchmark is distinguished by several key features: 1. Pure Natural Language Prompts. Task inputs consist solely of abstract natural language descriptions, devoid of any code or structural hints. 2. A Rigorous & Evolving Data Collection Process. FeatBench is built on a multi-level filtering pipeline to ensure quality and a fully automated pipeline to evolve the benchmark, mitigating data contamination. 3. Comprehensive Test Cases. Each task includes Fail-to-Pass (F2P) and Pass-to-Pass (P2P) tests to verify correctness and prevent regressions. 4. Diverse Application Domains. The benchmark includes repositories from diverse domains to ensure it reflects real-world scenarios. We evaluate two state-of-the-art agent frameworks with four leading LLMs on FeatBench. Our evaluation reveals that feature implementation within the vibe coding paradigm is a significant challenge, with the highest success rate of only 29.94%. Our analysis also reveals a tendency for "aggressive implementation," a strategy that paradoxically leads to both critical failures and superior software design. We release FeatBench, our automated collection pipeline, and all experimental results to facilitate further community research.