🤖 AI Summary
To address the degradation in cross-lingual code translation accuracy caused by long source code exceeding large language models’ context windows, this paper proposes a zero-shot identifier compression translation method. Without fine-tuning or in-context examples, it applies semantics-preserving identifier replacement—substituting verbose identifiers with compact placeholders—thereby significantly reducing input token count while preserving syntactic structure and control flow. Experiments across mainstream language pairs (e.g., Java→Python) show an average 38.2% reduction in token consumption and a 12.7-point BLEU+ improvement in translation accuracy, while maintaining function signatures, nested structures, and type consistency. The core contribution is the first introduction of lightweight identifier normalization into zero-shot code translation, achieving a favorable trade-off among computational efficiency, interpretability, and cross-domain generalization.
📝 Abstract
In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often struggle with long source codes that don't fit into the context window, which produces inaccurate translations. To address this, we propose a novel zero-shot code translation method that incorporates identifier replacement. By substituting user-given long identifiers with generalized placeholders during translation, our method allows the LLM to focus on the logical structure of the code, by reducing token count and memory usage, which improves the efficiency and cost-effectiveness of long code translation. Our empirical results demonstrate that our approach preserves syntactical and hierarchical information and produces translation results with reduced tokens.