CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing LLM-based code evaluation methods suffer from narrow domain knowledge coverage and insufficient understanding of complex semantics. This paper proposes the first LLM-agent-based framework for code generation assessment, featuring a two-stage “multi-source knowledge analysis → deliberative scoring” mechanism. It integrates dynamic retrieval of heterogeneous domain knowledge via multi-agent collaboration, stepwise evaluation planning, and multi-role debate to enable deep functional and semantic analysis of code. Innovatively, it introduces code-semantic-aware prompting and multi-metric consistency calibration to overcome limitations of single-round scoring. Experiments demonstrate significant improvements in correlation metrics: +0.202 (Pearson), +0.139 (Spearman), and +0.117 (Kendall-Tau). Moreover, the framework generates interpretable, consensus-driven, fine-grained diagnostic reports, substantially enhancing assessment reliability and practical value for developer feedback.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated strong capabilities in code generation, underscoring the critical need for rigorous and comprehensive evaluation. Existing evaluation approaches fall into three categories, including human-centered, metric-based, and LLM-based. Considering that human-centered approaches are labour-intensive and metric-based ones overly rely on reference answers, LLM-based approaches are gaining increasing attention due to their stronger contextual understanding capabilities and superior efficiency. However, the performance of LLM-based approaches remains limited due to: (1) lack of multisource domain knowledge, and (2) insufficient comprehension of complex code. To mitigate the limitations, we propose CodeVisionary, the first LLM-based agent framework for evaluating LLMs in code generation. CodeVisionary consists of two stages: (1) Multiscore knowledge analysis stage, which aims to gather multisource and comprehensive domain knowledge by formulating and executing a stepwise evaluation plan. (2) Negotiation-based scoring stage, which involves multiple judges engaging in discussions to better comprehend the complex code and reach a consensus on the evaluation score. Extensive experiments demonstrate that CodeVisionary achieves the best performance for evaluating LLMs in code generation, outperforming the best baseline methods with average improvements of 0.202, 0.139, and 0.117 in Pearson, Spearman, and Kendall-Tau coefficients, respectively. Besides, CodeVisionary provides detailed evaluation reports, which assist developers in identifying shortcomings and making improvements. The resources of CodeVisionary are available at https://anonymous.4open.science/r/CodeVisionary.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' code generation lacks comprehensive domain knowledge.

Current methods struggle with complex code comprehension.

Human and metric-based evaluations are inefficient or limited.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent-based framework for LLM code evaluation

Multiscore knowledge analysis for comprehensive assessment

Negotiation-based scoring for consensus on complex code

🔎 Similar Papers

A Survey on Evaluating Large Language Models in Code Generation Tasks