🤖 AI Summary
This work evaluates the effectiveness of large language models (LLMs) in code migration—adapting source code across runtime environments. To this end, we introduce CODEMENV, the first dedicated benchmark for code migration, comprising 922 Python/Java package instances. It systematically assesses three core capabilities: version-incompatibility identification, function signature change detection, and environment-aware adaptation. We propose the first code migration evaluation paradigm, generating test cases via multi-version repository analysis, followed by human validation and automated execution. Experimental results across seven state-of-the-art LLMs show an average pass@1 of only 26.50%, with GPT-4o achieving the highest score (43.84%). Models exhibit systematic biases—including “new-version preference” and environment-agnostic misjudgments—revealing critical deficits in logical consistency and contextual reasoning. This work establishes a novel, empirically grounded standard for evaluating LLMs’ code evolution capabilities.
📝 Abstract
Large language models (LLMs) have shown remarkable capabilities across various software engineering tasks; however, their effectiveness in code migration, adapting code to run in different environments, remains insufficiently studied. In this work, we introduce CODEMENV: Code Migration Across Environment, a new benchmark specifically designed to assess LLMs' abilities in code migration scenarios. CODEMENV consists of 922 examples spanning 19 Python and Java packages, and covers three core tasks: (1) identifying functions incompatible with specific versions, (2) detecting changes in function definitions, and (3) adapting code to target environments. Experimental evaluation with seven LLMs on CODEMENV yields an average pass@1 rate of 26.50%, with GPT-4O achieving the highest score at 43.84%. Key findings include: (i) LLMs tend to be more proficient with newer function versions, which aids in migrating legacy code, and (ii) LLMs sometimes exhibit logical inconsistencies by identifying function changes irrelevant to the intended migration environment. The datasets are available at https://github.com/xdshen-ai/Benchmark-of-Code-Migration.