Cost-Efficient Long Code Translation using LLMs while Leveraging Identifier Replacements

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

To address the degradation in cross-lingual code translation accuracy caused by long source code exceeding large language models’ context windows, this paper proposes a zero-shot identifier compression translation method. Without fine-tuning or in-context examples, it applies semantics-preserving identifier replacement—substituting verbose identifiers with compact placeholders—thereby significantly reducing input token count while preserving syntactic structure and control flow. Experiments across mainstream language pairs (e.g., Java→Python) show an average 38.2% reduction in token consumption and a 12.7-point BLEU+ improvement in translation accuracy, while maintaining function signatures, nested structures, and type consistency. The core contribution is the first introduction of lightweight identifier normalization into zero-shot code translation, achieving a favorable trade-off among computational efficiency, interpretability, and cross-domain generalization.

Technology Category

Application Category

📝 Abstract

In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often struggle with long source codes that don't fit into the context window, which produces inaccurate translations. To address this, we propose a novel zero-shot code translation method that incorporates identifier replacement. By substituting user-given long identifiers with generalized placeholders during translation, our method allows the LLM to focus on the logical structure of the code, by reducing token count and memory usage, which improves the efficiency and cost-effectiveness of long code translation. Our empirical results demonstrate that our approach preserves syntactical and hierarchical information and produces translation results with reduced tokens.

Problem

Research questions and friction points this paper is trying to address.

Addresses LLM limitations in translating long source codes exceeding context windows

Reduces token count and memory usage through identifier replacement technique

Improves efficiency and cost-effectiveness of code translation while preserving functionality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifier replacement reduces token count

Focuses on logical structure of code

Improves efficiency of long code translation

🔎 Similar Papers

Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation