Quality Assurance of LLM-generated Code: Addressing Non-Functional Quality Characteristics

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of non-functional quality—specifically security, maintainability, and performance efficiency—in code generated by large language models (LLMs). Grounded in the ISO/IEC 25010 standard, it integrates a systematic literature review, dual-industry workshops, and multi-model empirical experiments (GPT-4, Claude, CodeLlama) to conduct multidimensional quality analysis on real-world software defect-fix patches. It introduces the first non-functional quality assessment framework reconciling academic rigor with industrial relevance, uncovering significant trade-offs among the three quality attributes and exposing gaps between LLM outputs and actual engineering requirements—including technical debt accumulation. Results demonstrate that functional correctness does not imply high non-functional quality, and that model architecture and optimization strategies yield markedly divergent outcomes across non-functional dimensions. The work provides both theoretical foundations and actionable guidelines for designing robust quality assurance mechanisms for LLM-generated code.

Technology Category

Application Category

📝 Abstract

In recent years, LLMs have been widely integrated into software engineering workflows, supporting tasks like code generation. However, while these models often generate functionally correct outputs, we still lack a systematic understanding and evaluation of their non-functional qualities. Existing studies focus mainly on whether generated code passes the tests rather than whether it passes with quality. Guided by the ISO/IEC 25010 quality model, this study conducted three complementary investigations: a systematic review of 108 papers, two industry workshops with practitioners from multiple organizations, and an empirical analysis of patching real-world software issues using three LLMs. Motivated by insights from both the literature and practitioners, the empirical study examined the quality of generated patches on security, maintainability, and performance efficiency. Across the literature, we found that security and performance efficiency dominate academic attention, while maintainability and other qualities are understudied. In contrast, industry experts prioritize maintainability and readability, warning that generated code may accelerate the accumulation of technical debt. In our evaluation of functionally correct patches generated by three LLMs, improvements in one quality dimension often come at the cost of others. Runtime and memory results further show high variance across models and optimization strategies. Overall, our findings reveal a mismatch between academic focus, industry priorities, and model performance, highlighting the urgent need to integrate quality assurance mechanisms into LLM code generation pipelines to ensure that future generated code not only passes tests but truly passes with quality.

Problem

Research questions and friction points this paper is trying to address.

Evaluating non-functional quality of LLM-generated code beyond functional correctness

Investigating mismatches between academic focus and industry priorities on code quality

Addressing quality trade-offs in generated patches for security, maintainability and performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Used ISO/IEC 25010 quality model for evaluation

Conducted systematic review and industry workshops

Empirically analyzed patches on multiple quality dimensions

🔎 Similar Papers

Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models