On Iterative Evaluation and Enhancement of Code Quality Using GPT-4o

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This study addresses the challenge of LLM-driven automated code quality assessment and improvement. We propose CodeQUEST, a framework featuring a closed-loop architecture with integrated evaluation and optimization modules. It enables, for the first time, fine-grained, multi-dimensional, and verifiable iterative optimization of code by an LLM (GPT-4o) across ten dimensions—including readability, maintainability, efficiency, and security. To ensure objectivity and reproducibility, we introduce a proxy-metric calibration mechanism that integrates established static analyzers (Pylint, Radon, Bandit) with multi-dimensional quantitative scoring. Experiments on Python and JavaScript benchmarks demonstrate an average relative quality improvement of 52.6%. Moreover, CodeQUEST’s assessments exhibit strong correlation (ρ > 0.89) with mainstream static analysis tools. The framework establishes a scalable, empirically grounded paradigm for integrating LLMs into software engineering practice.

Technology Category

Application Category

📝 Abstract

This paper introduces CodeQUEST, a novel framework leveraging Large Language Models (LLMs) to iteratively evaluate and enhance code quality across multiple dimensions, including readability, maintainability, efficiency, and security. The framework is divided into two main components: an Evaluator that assesses code quality across ten dimensions, providing both quantitative scores and qualitative summaries, and an Optimizer that iteratively improves the code based on the Evaluator's feedback. Our study demonstrates that CodeQUEST can effectively and robustly evaluate code quality, with its assessments aligning closely with established code quality metrics. Through a series of experiments using a curated dataset of Python and JavaScript examples, CodeQUEST demonstrated significant improvements in code quality, achieving a mean relative percentage improvement of 52.6%. The framework's evaluations were validated against a set of proxy metrics comprising of Pylint Score, Radon Maintainability Index, and Bandit output logs, showing a meaningful correlation. This highlights the potential of LLMs in automating code quality evaluation and improvement processes, presenting a significant advancement toward enhancing software development practices. The code implementation of the framework is available at: https://github.com/jpmorganchase/CodeQuest.

Problem

Research questions and friction points this paper is trying to address.

Automates code quality evaluation and enhancement

Leverages GPT-4o for iterative code improvement

Validates improvements using established quality metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages GPT-4 for code evaluation

Iterative optimization of code quality

Validated with established code metrics

🔎 Similar Papers

How Well Do Large Language Models Serve as End-to-End Secure Code Producers?