🤖 AI Summary
This work investigates the capability of large language models (LLMs) to generate **functionally correct class-level code** in real-world software projects—moving beyond prior studies focused on function-level generation or synthetic benchmarks. The authors introduce the first **real-world, open-source-derived class-level code generation benchmark**, along with a dedicated evaluation framework that distinguishes “seen” versus “unseen” codebases to assess generalization. They conduct systematic experiments across multiple LLMs, incorporating retrieval-augmented generation (RAG), ablation studies on input specification quality, and fine-grained error mode analysis. Results show that current LLMs achieve only 25–34% functional correctness on real class generation—substantially lower than their 84–89% performance on synthetic benchmarks. RAG yields modest gains (4–7%) only when documentation is comprehensive; models exhibit low sensitivity to documentation quality overall. The predominant failure modes are AttributeError, TypeError, and AssertionError.
📝 Abstract
Large language models (LLMs) have advanced code generation at the function level, yet their ability to produce correct class-level implementations in authentic software projects remains poorly understood. This work introduces a novel benchmark derived from open-source repositories, comprising real-world classes divided into seen and unseen partitions to evaluate generalization under practical conditions. The evaluation examines multiple LLMs under varied input specifications, retrieval-augmented configurations, and documentation completeness levels.
Results reveal a stark performance disparity: LLMs achieve 84% to 89% correctness on established synthetic benchmarks but only 25% to 34% on real-world class tasks, with negligible differences between familiar and novel codebases. Comprehensive docstrings yield modest gains of 1% to 3% in functional accuracy, though statistical significance is rare. Retrieval-augmented generation proves most effective with partial documentation, improving correctness by 4% to 7% by supplying concrete implementation patterns absent from specifications. Error profiling identifies AttributeError, TypeError, and AssertionError as dominant failure modes (84% of cases), with synthetic tests overemphasizing assertion issues and real-world scenarios highlighting type and attribute mismatches. Retrieval augmentation reduces logical flaws but can introduce dependency conflicts.
The benchmark and analysis expose critical limitations in current LLM capabilities for class-level engineering, offering actionable insights for enhancing context modelling, documentation strategies, and retrieval integration in production code assistance tools.