Quality In, Quality Out: Investigating Training Data's Role in AI Code Generation

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This study investigates the causal mechanism by which training data quality affects the performance of AI code generators. We conduct controlled fine-tuning experiments on pre-trained models (e.g., CodeT5, CodeGen), comparing supervision on raw, defect-prone Python function data versus high-quality data curated via human annotation and rule-based cleaning. Generated code is rigorously evaluated across multiple dimensions: security, maintainability, and adherence to best practices. Our work provides the first empirical evidence establishing a direct causal link between low-quality training data and increased defect rates in generated code. Crucially, merely cleaning the training dataset reduces the functional defect rate from 5.85% to 2.16%—a statistically significant improvement—while preserving functional correctness. These findings underscore data quality as a critical determinant of code generation reliability and establish a reproducible, data-centric governance framework for developing trustworthy AI programming assistants.

Technology Category

Application Category

📝 Abstract

Deep Learning-based code generators have seen significant advancements in recent years. Tools such as GitHub Copilot are used by thousands of developers with the main promise of a boost in productivity. However, researchers have recently questioned their impact on code quality showing, for example, that code generated by DL-based tools may be affected by security vulnerabilities. Since DL models are trained on large code corpora, one may conjecture that low-quality code they output is the result of low-quality code they have seen during training. However, there is very little empirical evidence documenting this phenomenon. Indeed, most of previous work look at the frequency with which commercial code generators recommend low-quality code without the possibility of relating this to their training set. We investigate the extent to which low-quality code instances seen during training affect the quality of the code generated at inference time. We start by fine-tuning a pre-trained DL model on a large-scale dataset being representative of those usually adopted in the training of code generators. We show that 4.98% of functions in this dataset exhibit one or more quality issues related to security, maintainability, best practices, etc. We use the fine-tuned model to generate 551k Python functions, showing that 5.85% of them are affected by at least one quality issue. We then remove from the training set the low-quality functions, and use the cleaned dataset to fine-tune a second model which has been used to generate the same 551k Python functions. We show that the model trained on the cleaned dataset exhibits similar performance in terms of functional correctness as compared to the original model while, however, generating a statistically significant lower number of low-quality functions (2.16%). Our study empirically documents the importance of high-quality training data for code generators.

Problem

Research questions and friction points this paper is trying to address.

Investigates impact of low-quality training data on AI-generated code quality.

Examines relationship between training data quality and security vulnerabilities in code.

Demonstrates improved code quality by using cleaned, high-quality training datasets.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned DL model on large-scale dataset

Removed low-quality functions from training set

Generated lower low-quality code with cleaned dataset

🔎 Similar Papers

Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models