π€ AI Summary
Existing LLM-based code evaluation methods suffer from narrow domain knowledge coverage and insufficient understanding of complex semantics. This paper proposes the first LLM-agent-based framework for code generation assessment, featuring a two-stage βmulti-source knowledge analysis β deliberative scoringβ mechanism. It integrates dynamic retrieval of heterogeneous domain knowledge via multi-agent collaboration, stepwise evaluation planning, and multi-role debate to enable deep functional and semantic analysis of code. Innovatively, it introduces code-semantic-aware prompting and multi-metric consistency calibration to overcome limitations of single-round scoring. Experiments demonstrate significant improvements in correlation metrics: +0.202 (Pearson), +0.139 (Spearman), and +0.117 (Kendall-Tau). Moreover, the framework generates interpretable, consensus-driven, fine-grained diagnostic reports, substantially enhancing assessment reliability and practical value for developer feedback.
π Abstract
Large language models (LLMs) have demonstrated strong capabilities in code generation, underscoring the critical need for rigorous and comprehensive evaluation. Existing evaluation approaches fall into three categories, including human-centered, metric-based, and LLM-based. Considering that human-centered approaches are labour-intensive and metric-based ones overly rely on reference answers, LLM-based approaches are gaining increasing attention due to their stronger contextual understanding capabilities and superior efficiency. However, the performance of LLM-based approaches remains limited due to: (1) lack of multisource domain knowledge, and (2) insufficient comprehension of complex code. To mitigate the limitations, we propose CodeVisionary, the first LLM-based agent framework for evaluating LLMs in code generation. CodeVisionary consists of two stages: (1) Multiscore knowledge analysis stage, which aims to gather multisource and comprehensive domain knowledge by formulating and executing a stepwise evaluation plan. (2) Negotiation-based scoring stage, which involves multiple judges engaging in discussions to better comprehend the complex code and reach a consensus on the evaluation score. Extensive experiments demonstrate that CodeVisionary achieves the best performance for evaluating LLMs in code generation, outperforming the best baseline methods with average improvements of 0.202, 0.139, and 0.117 in Pearson, Spearman, and Kendall-Tau coefficients, respectively. Besides, CodeVisionary provides detailed evaluation reports, which assist developers in identifying shortcomings and making improvements. The resources of CodeVisionary are available at https://anonymous.4open.science/r/CodeVisionary.