Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work investigates the capability of large language models (LLMs) to generate **functionally correct class-level code** in real-world software projects—moving beyond prior studies focused on function-level generation or synthetic benchmarks. The authors introduce the first **real-world, open-source-derived class-level code generation benchmark**, along with a dedicated evaluation framework that distinguishes “seen” versus “unseen” codebases to assess generalization. They conduct systematic experiments across multiple LLMs, incorporating retrieval-augmented generation (RAG), ablation studies on input specification quality, and fine-grained error mode analysis. Results show that current LLMs achieve only 25–34% functional correctness on real class generation—substantially lower than their 84–89% performance on synthetic benchmarks. RAG yields modest gains (4–7%) only when documentation is comprehensive; models exhibit low sensitivity to documentation quality overall. The predominant failure modes are AttributeError, TypeError, and AssertionError.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have advanced code generation at the function level, yet their ability to produce correct class-level implementations in authentic software projects remains poorly understood. This work introduces a novel benchmark derived from open-source repositories, comprising real-world classes divided into seen and unseen partitions to evaluate generalization under practical conditions. The evaluation examines multiple LLMs under varied input specifications, retrieval-augmented configurations, and documentation completeness levels. Results reveal a stark performance disparity: LLMs achieve 84% to 89% correctness on established synthetic benchmarks but only 25% to 34% on real-world class tasks, with negligible differences between familiar and novel codebases. Comprehensive docstrings yield modest gains of 1% to 3% in functional accuracy, though statistical significance is rare. Retrieval-augmented generation proves most effective with partial documentation, improving correctness by 4% to 7% by supplying concrete implementation patterns absent from specifications. Error profiling identifies AttributeError, TypeError, and AssertionError as dominant failure modes (84% of cases), with synthetic tests overemphasizing assertion issues and real-world scenarios highlighting type and attribute mismatches. Retrieval augmentation reduces logical flaws but can introduce dependency conflicts. The benchmark and analysis expose critical limitations in current LLM capabilities for class-level engineering, offering actionable insights for enhancing context modelling, documentation strategies, and retrieval integration in production code assistance tools.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM performance on real-world class-level code generation

Assessing generalization using seen and unseen repository partitions

Identifying dominant error types and retrieval augmentation impacts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel benchmark from open-source repositories

Evaluates LLMs with varied input specifications

Uses retrieval-augmented generation for improvement

🔎 Similar Papers

A Survey on Evaluating Large Language Models in Code Generation Tasks