On Iterative Evaluation and Enhancement of Code Quality Using GPT-4o

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of LLM-driven automated code quality assessment and improvement. We propose CodeQUEST, a framework featuring a closed-loop architecture with integrated evaluation and optimization modules. It enables, for the first time, fine-grained, multi-dimensional, and verifiable iterative optimization of code by an LLM (GPT-4o) across ten dimensions—including readability, maintainability, efficiency, and security. To ensure objectivity and reproducibility, we introduce a proxy-metric calibration mechanism that integrates established static analyzers (Pylint, Radon, Bandit) with multi-dimensional quantitative scoring. Experiments on Python and JavaScript benchmarks demonstrate an average relative quality improvement of 52.6%. Moreover, CodeQUEST’s assessments exhibit strong correlation (ρ > 0.89) with mainstream static analysis tools. The framework establishes a scalable, empirically grounded paradigm for integrating LLMs into software engineering practice.

Technology Category

Application Category

📝 Abstract
This paper introduces CodeQUEST, a novel framework leveraging Large Language Models (LLMs) to iteratively evaluate and enhance code quality across multiple dimensions, including readability, maintainability, efficiency, and security. The framework is divided into two main components: an Evaluator that assesses code quality across ten dimensions, providing both quantitative scores and qualitative summaries, and an Optimizer that iteratively improves the code based on the Evaluator's feedback. Our study demonstrates that CodeQUEST can effectively and robustly evaluate code quality, with its assessments aligning closely with established code quality metrics. Through a series of experiments using a curated dataset of Python and JavaScript examples, CodeQUEST demonstrated significant improvements in code quality, achieving a mean relative percentage improvement of 52.6%. The framework's evaluations were validated against a set of proxy metrics comprising of Pylint Score, Radon Maintainability Index, and Bandit output logs, showing a meaningful correlation. This highlights the potential of LLMs in automating code quality evaluation and improvement processes, presenting a significant advancement toward enhancing software development practices. The code implementation of the framework is available at: https://github.com/jpmorganchase/CodeQuest.
Problem

Research questions and friction points this paper is trying to address.

Automates code quality evaluation and enhancement
Leverages GPT-4o for iterative code improvement
Validates improvements using established quality metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages GPT-4 for code evaluation
Iterative optimization of code quality
Validated with established code metrics
🔎 Similar Papers
No similar papers found.
R
Rundong Liu
JPMorgan Chase
A
Andre Frade
JPMorgan Chase
A
Amal Vaidya
JPMorgan Chase
Maxime Labonne
Maxime Labonne
Head of Post-Training, Liquid AI
Large Language ModelsGraph Neural NetworksMachine LearningCyber Security
Marcus Kaiser
Marcus Kaiser
JPMorgan Chase
B
Bismayan Chakrabarti
JPMorgan Chase
J
Jonathan Budd
JPMorgan Chase
Sean Moran
Sean Moran
TWG Global AI
Generative AILarge Language ModelsComputer VisionInformation Retrieval