Cost-Efficient Long Code Translation using LLMs while Leveraging Identifier Replacements

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the degradation in cross-lingual code translation accuracy caused by long source code exceeding large language models’ context windows, this paper proposes a zero-shot identifier compression translation method. Without fine-tuning or in-context examples, it applies semantics-preserving identifier replacement—substituting verbose identifiers with compact placeholders—thereby significantly reducing input token count while preserving syntactic structure and control flow. Experiments across mainstream language pairs (e.g., Java→Python) show an average 38.2% reduction in token consumption and a 12.7-point BLEU+ improvement in translation accuracy, while maintaining function signatures, nested structures, and type consistency. The core contribution is the first introduction of lightweight identifier normalization into zero-shot code translation, achieving a favorable trade-off among computational efficiency, interpretability, and cross-domain generalization.

Technology Category

Application Category

📝 Abstract
In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often struggle with long source codes that don't fit into the context window, which produces inaccurate translations. To address this, we propose a novel zero-shot code translation method that incorporates identifier replacement. By substituting user-given long identifiers with generalized placeholders during translation, our method allows the LLM to focus on the logical structure of the code, by reducing token count and memory usage, which improves the efficiency and cost-effectiveness of long code translation. Our empirical results demonstrate that our approach preserves syntactical and hierarchical information and produces translation results with reduced tokens.
Problem

Research questions and friction points this paper is trying to address.

Addresses LLM limitations in translating long source codes exceeding context windows
Reduces token count and memory usage through identifier replacement technique
Improves efficiency and cost-effectiveness of code translation while preserving functionality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifier replacement reduces token count
Focuses on logical structure of code
Improves efficiency of long code translation
🔎 Similar Papers
2024-03-252024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering (Forge) Conference Acronym:Citations: 22