🤖 AI Summary
This study systematically evaluates ChatGPT (v3.5 and v4) on Japanese–English translation, benchmarking against leading commercial engines (e.g., Google Translate, DeepL) and distinguishing document-level versus sentence-level translation performance. It further investigates the impact of simple versus context-augmented prompting on output quality. Evaluation employs both automatic metrics (BLEU, COMET) and human annotation via the MQM framework. Results show: (1) Document-level context substantially improves coherence and coreference consistency; (2) GPT-3.5 prioritizes accuracy, whereas GPT-4 excels in fluency and stylistic appropriateness; (3) With optimized context-aware prompting, GPT-4 achieves parity with state-of-the-art commercial systems. This work provides the first empirical validation—specifically for Japanese–English—of LLMs’ document-level translation advantages and uncovers critical trade-offs between prompting strategies and model versions in translation quality.
📝 Abstract
This study investigates ChatGPT for Japanese-English translation, exploring simple and enhanced prompts and comparing against commercially available translation engines. Performing both automatic and MQM-based human evaluations, we found that document-level translation outperforms sentence-level translation for ChatGPT. On the other hand, we were not able to determine if enhanced prompts performed better than simple prompts in our experiments. We also discovered that ChatGPT-3.5 was preferred by automatic evaluation, but a tradeoff exists between accuracy (ChatGPT-3.5) and fluency (ChatGPT-4). Lastly, ChatGPT yields competitive results against two widely-known translation systems.