🤖 AI Summary
Large language models (LLMs) for machine translation suffer from deficient reasoning capabilities; existing chain-of-thought (CoT) methods rely on rigid templates, lack human alignment, and supervised fine-tuning (SFT) induces catastrophic forgetting. Method: We propose a reasoning-driven zero-shot cross-lingual transfer framework that introduces structured, multi-level human translation strategies—formalized as six expert-defined CoT templates—into general-purpose translation. Our approach integrates KL-constrained reinforcement learning with multi-stage reasoning distillation to jointly achieve human-aligned CoT modeling and autonomous CoT discovery. Results: Evaluated on Flores-101 across 21 languages and 80 translation directions, our method significantly improves translation quality—especially for 15 unseen languages—outperforming standard SFT in cross-lingual generalization. It demonstrates robust performance across specialized domains including law and healthcare.
📝 Abstract
Despite recent breakthroughs in reasoning-enhanced large language models (LLMs) like DeepSeek-R1, incorporating inference-time reasoning into machine translation (MT), where human translators naturally employ structured, multi-layered reasoning chain-of-thoughts (CoTs), is yet underexplored. Existing methods either design a fixed CoT tailored for a specific MT sub-task (e.g., literature translation), or rely on synthesizing CoTs unaligned with humans and supervised fine-tuning (SFT) prone to catastrophic forgetting, limiting their adaptability to diverse translation scenarios. This paper introduces R1-Translator (R1-T1), a novel framework to achieve inference-time reasoning for general MT via reinforcement learning (RL) with human-aligned CoTs comprising six common patterns. Our approach pioneers three innovations: (1) extending reasoning-based translation beyond MT sub-tasks to six languages and diverse tasks (e.g., legal/medical domain adaptation, idiom resolution); (2) formalizing six expert-curated CoT templates that mirror hybrid human strategies like context-aware paraphrasing and back translation; and (3) enabling self-evolving CoT discovery and anti-forgetting adaptation through RL with KL-constrained rewards. Experimental results indicate a steady translation performance improvement in 21 languages and 80 translation directions on Flores-101 test set, especially on the 15 languages unseen from training, with its general multilingual abilities preserved compared with plain SFT.