CODEMENV: Benchmarking Large Language Models on Code Migration

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work evaluates the effectiveness of large language models (LLMs) in code migration—adapting source code across runtime environments. To this end, we introduce CODEMENV, the first dedicated benchmark for code migration, comprising 922 Python/Java package instances. It systematically assesses three core capabilities: version-incompatibility identification, function signature change detection, and environment-aware adaptation. We propose the first code migration evaluation paradigm, generating test cases via multi-version repository analysis, followed by human validation and automated execution. Experimental results across seven state-of-the-art LLMs show an average pass@1 of only 26.50%, with GPT-4o achieving the highest score (43.84%). Models exhibit systematic biases—including “new-version preference” and environment-agnostic misjudgments—revealing critical deficits in logical consistency and contextual reasoning. This work establishes a novel, empirically grounded standard for evaluating LLMs’ code evolution capabilities.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have shown remarkable capabilities across various software engineering tasks; however, their effectiveness in code migration, adapting code to run in different environments, remains insufficiently studied. In this work, we introduce CODEMENV: Code Migration Across Environment, a new benchmark specifically designed to assess LLMs' abilities in code migration scenarios. CODEMENV consists of 922 examples spanning 19 Python and Java packages, and covers three core tasks: (1) identifying functions incompatible with specific versions, (2) detecting changes in function definitions, and (3) adapting code to target environments. Experimental evaluation with seven LLMs on CODEMENV yields an average pass@1 rate of 26.50%, with GPT-4O achieving the highest score at 43.84%. Key findings include: (i) LLMs tend to be more proficient with newer function versions, which aids in migrating legacy code, and (ii) LLMs sometimes exhibit logical inconsistencies by identifying function changes irrelevant to the intended migration environment. The datasets are available at https://github.com/xdshen-ai/Benchmark-of-Code-Migration.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' effectiveness in code migration tasks

Evaluating LLMs on identifying incompatible functions and changes

Measuring LLMs' ability to adapt code to target environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CODEMENV benchmark for code migration

Evaluates LLMs on 922 Python and Java examples

Tests function version compatibility and adaptation

🔎 Similar Papers

No similar papers found.