đ€ AI Summary
This study addresses the fundamental trade-off between task specialization (e.g., machine translation) and general-purpose capabilities (e.g., dialogue, reasoning, instruction following) in multilingual large language models. We propose a Pareto-optimal multi-stage training paradigm: continued pretraining â supervised fine-tuning â preference optimization â verifiable-reward reinforcement learning, integrated with multi-task data generation and rigorous filtering. We develop a family of multilingual models at three scalesâ2B, 9B, and 72B parametersâand introduce IF-MT, the first dedicated benchmark for instruction-following machine translation evaluation. Experimental results demonstrate that our 2B and 9B models outperform Llama 3.3-70B; the 72B model achieves state-of-the-art performance on high-resource language translation, Multilingual Arena Hard, and IF-MTâmarking the first instance where translation specialization and broad general capabilities are simultaneously and synergistically enhanced.
đ Abstract
Fine-tuning pretrained LLMs has been shown to be an effective strategy for reaching state-of-the-art performance on specific tasks like machine translation. However, this process of adaptation often implies sacrificing general-purpose capabilities, such as conversational reasoning and instruction-following, hampering the utility of the system in real-world applications that require a mixture of skills. In this paper, we introduce Tower+, a suite of models designed to deliver strong performance across both translation and multilingual general-purpose text capabilities. We achieve a Pareto frontier between translation specialization and multilingual general-purpose capabilities by introducing a novel training recipe that builds on Tower (Alves et al., 2024), comprising continued pretraining, supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards. At each stage of training, we carefully generate and curate data to strengthen performance on translation as well as general-purpose tasks involving code generation, mathematics problem solving, and general instruction-following. We develop models at multiple scales: 2B, 9B, and 72B. Our smaller models often outperform larger general-purpose open-weight and proprietary LLMs (e.g., Llama 3.3 70B, GPT-4o). Our largest model delivers best-in-class translation performance for high-resource languages and top results in multilingual Arena Hard evaluations and in IF-MT, a benchmark we introduce for evaluating both translation and instruction-following. Our findings highlight that it is possible to rival frontier models in general capabilities, while optimizing for specific business domains, such as translation and localization.