CODEMENV: Benchmarking Large Language Models on Code Migration

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work evaluates the effectiveness of large language models (LLMs) in code migration—adapting source code across runtime environments. To this end, we introduce CODEMENV, the first dedicated benchmark for code migration, comprising 922 Python/Java package instances. It systematically assesses three core capabilities: version-incompatibility identification, function signature change detection, and environment-aware adaptation. We propose the first code migration evaluation paradigm, generating test cases via multi-version repository analysis, followed by human validation and automated execution. Experimental results across seven state-of-the-art LLMs show an average pass@1 of only 26.50%, with GPT-4o achieving the highest score (43.84%). Models exhibit systematic biases—including “new-version preference” and environment-agnostic misjudgments—revealing critical deficits in logical consistency and contextual reasoning. This work establishes a novel, empirically grounded standard for evaluating LLMs’ code evolution capabilities.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have shown remarkable capabilities across various software engineering tasks; however, their effectiveness in code migration, adapting code to run in different environments, remains insufficiently studied. In this work, we introduce CODEMENV: Code Migration Across Environment, a new benchmark specifically designed to assess LLMs' abilities in code migration scenarios. CODEMENV consists of 922 examples spanning 19 Python and Java packages, and covers three core tasks: (1) identifying functions incompatible with specific versions, (2) detecting changes in function definitions, and (3) adapting code to target environments. Experimental evaluation with seven LLMs on CODEMENV yields an average pass@1 rate of 26.50%, with GPT-4O achieving the highest score at 43.84%. Key findings include: (i) LLMs tend to be more proficient with newer function versions, which aids in migrating legacy code, and (ii) LLMs sometimes exhibit logical inconsistencies by identifying function changes irrelevant to the intended migration environment. The datasets are available at https://github.com/xdshen-ai/Benchmark-of-Code-Migration.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' effectiveness in code migration tasks
Evaluating LLMs on identifying incompatible functions and changes
Measuring LLMs' ability to adapt code to target environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CODEMENV benchmark for code migration
Evaluates LLMs on 922 Python and Java examples
Tests function version compatibility and adaptation
🔎 Similar Papers
No similar papers found.
K
Keyuan Cheng
Provable Responsible AI and Data Analytics (PRADA) Lab, South China University of Technology
X
Xudong Shen
Provable Responsible AI and Data Analytics (PRADA) Lab, South China University of Technology
Y
Yihao Yang
Provable Responsible AI and Data Analytics (PRADA) Lab, South China University of Technology
T
Tengyue Wang
Provable Responsible AI and Data Analytics (PRADA) Lab, South China University of Technology
Y
Yang Cao
Provable Responsible AI and Data Analytics (PRADA) Lab, South China University of Technology
Muhammad Asif Ali
Muhammad Asif Ali
King Abdullah University of Science and Technology
NLPDeep LearningMachine Learning
Hanbin Wang
Hanbin Wang
Peking University
Natural Language ProcessingCode IntelligenceInformation Retrieval
Lijie Hu
Lijie Hu
Assistant Professor, MBZUAI
Explainable AILLMDifferential Privacy
D
Di Wang
Provable Responsible AI and Data Analytics (PRADA) Lab, King Abdullah University of Science and Technology