FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code generation benchmarks inadequately evaluate coding agents’ capability to implement features under “vibe coding”—a paradigm grounded in natural-language instructions rather than formal code-level specifications or isolated problem-solving tasks, thus failing to reflect real-world feature development practices. Method: We introduce FeatBench, the first benchmark explicitly designed to assess feature implementation ability, featuring purely natural-language instructions, diverse application scenarios, and rigorous testing protocols. Contribution/Results: FeatBench introduces a high-quality, code-free, automatically evolvable dataset built via multi-stage filtering and automated pipelines; it defines two novel test paradigms—Function-to-Program (F2P) and Program-to-Program (P2P). Empirical evaluation reveals that state-of-the-art agents achieve only 29.94% success rate in feature implementation, exposing the double-edged nature of aggressive implementation strategies. All tools and data are publicly released.

Technology Category

Application Category

📝 Abstract
The rapid advancement of Large Language Models (LLMs) has given rise to a novel software development paradigm known as "vibe coding," where users interact with coding agents through high-level natural language. However, existing evaluation benchmarks for code generation inadequately assess an agent's vibe coding capabilities. Existing benchmarks are misaligned, as they either require code-level specifications or focus narrowly on issue-solving, neglecting the critical scenario of feature implementation within the vibe coding paradiam. To address this gap, we propose FeatBench, a novel benchmark for vibe coding that focuses on feature implementation. Our benchmark is distinguished by several key features: 1. Pure Natural Language Prompts. Task inputs consist solely of abstract natural language descriptions, devoid of any code or structural hints. 2. A Rigorous & Evolving Data Collection Process. FeatBench is built on a multi-level filtering pipeline to ensure quality and a fully automated pipeline to evolve the benchmark, mitigating data contamination. 3. Comprehensive Test Cases. Each task includes Fail-to-Pass (F2P) and Pass-to-Pass (P2P) tests to verify correctness and prevent regressions. 4. Diverse Application Domains. The benchmark includes repositories from diverse domains to ensure it reflects real-world scenarios. We evaluate two state-of-the-art agent frameworks with four leading LLMs on FeatBench. Our evaluation reveals that feature implementation within the vibe coding paradigm is a significant challenge, with the highest success rate of only 29.94%. Our analysis also reveals a tendency for "aggressive implementation," a strategy that paradoxically leads to both critical failures and superior software design. We release FeatBench, our automated collection pipeline, and all experimental results to facilitate further community research.
Problem

Research questions and friction points this paper is trying to address.

Evaluating coding agents on feature implementation for vibe coding
Assessing natural language programming without code specifications
Measuring feature implementation success in diverse application domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pure natural language prompts for task inputs
Automated pipeline for evolving benchmark data
Comprehensive test cases for correctness verification
🔎 Similar Papers
No similar papers found.
Haorui Chen
Haorui Chen
University of Electronic Science and Technology of China
C
Chengze Li
College of AI, Tsinghua University; Nanjing University
J
Jia Li
College of AI, Tsinghua University