CFCEval: Evaluating Security Aspects in Code Generated by Large Language Models

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing code evaluation methodologies suffer from three key limitations: (1) insufficiently rigorous evaluation protocols, (2) biased benchmark datasets, and (3) suboptimal metric design—particularly CodeBLEU—which underperforms in tokenization accuracy, structural modeling, and reference diversity. To address these issues, we propose CFCEval, a comprehensive evaluation framework comprising the unbiased benchmark MLVBench—constructed via deduplication and version control—and ELRM, a novel metric integrating abstract syntax tree (AST) analysis with semantic similarity modeling. CFCEval systematically assesses large code models across four dimensions: programming quality, vulnerability repair capability, security, and functional correctness—enabling, for the first time, a unified multidimensional evaluation of both safety and quality. Experimental results demonstrate that ELRM achieves significantly higher correlation with human judgments than CodeBLEU, and CFCEval substantially improves defect detection accuracy. This work establishes a more reliable, robust, and holistic evaluation standard for code generation by large language models.

Technology Category

Application Category

📝 Abstract

Code-focused Large Language Models (LLMs), such as CodeX and Star-Coder, have demonstrated remarkable capabilities in enhancing developer productivity through context-aware code generation. However, evaluating the quality and security of LLM-generated code remains a significant challenge. Existing evaluation protocols for Code LLMs lack both methodological rigor and comprehensive scope. A key limitation is dataset bias, which arises from unintentional overlap between training and testing data. Furthermore, while CodeBLEU, a BLEU-based metric, is widely used to assess code similarity, it suffers from critical shortcomings, including imprecise tokenization, structural limitations, and low reference diversity. To address these challenges, we introduce CFCEval, a novel framework for evaluating the quality and security of code generated by LLMs. CFCEval mitigates dataset bias by creating a new benchmark, MLVBench, and incorporates ELRM, a new metric designed to assess the relevance between reference code and generated code. CFCEval evaluates generated code across four dimensions: programming quality, vulnerability-fixing capability, post-transformation fixing capability, and relevance. Our experiments show that CFCEval not only captures both quality and security aspects of generated code more effectively but also that its ELRM aligns more closely with human judgments than CodeBLEU, thus paving the way for future advancements in Code LLMs evaluation.

Problem

Research questions and friction points this paper is trying to address.

Evaluates security and quality of LLM-generated code

Addresses dataset bias and evaluation metric limitations

Assesses code across programming, vulnerability, and relevance dimensions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CFCEval framework for evaluating LLM-generated code

Mitigates dataset bias with new benchmark MLVBench

Proposes ELRM metric aligning better with human judgment

🔎 Similar Papers

No similar papers found.

Authors to Follow