🤖 AI Summary
To address the challenges in evaluating LLM-driven COBOL-to-Java code translation quality—namely, model opacity, difficulty in quantifying evaluation metrics, and high cost of manual assessment—this paper proposes a multidimensional automated evaluation framework. The framework integrates static analysis checkers with LLM-as-a-Judge (LaaJ) techniques, embedded within a continuous integration pipeline and an automated reporting system to enable large-scale, reproducible benchmarking. Its key innovation lies in the first-of-its-kind synergistic modeling of classical program analysis and large language model–based judgment, yielding an interpretable and scalable quality scoring system. Experimental results demonstrate substantial reduction in reliance on manual review while significantly improving the reliability and engineering efficiency of IBM watsonx Code Assistant for Z in legacy system modernization.
📝 Abstract
We present an automated evaluation system for assessing COBOL-to-Java code translation within IBM's watsonx Code Assistant for Z (WCA4Z). The system addresses key challenges in evaluating LLM-based translators, including model opacity and the complexity of translation quality assessment. Our approach combines analytic checkers with LLM-as-a-judge (LaaJ) techniques to deliver scalable, multi-faceted evaluations. The system supports continuous integration workflows, enables large-scale benchmarking, and reduces reliance on manual review. We describe the system architecture, evaluation strategies, and reporting mechanisms that provide actionable insights for developers and project managers, facilitating the evolution of high-quality, modernized codebases.