Multilingual Contextualization of Large Language Models for Document-Level Machine Translation

📅 2025-04-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) struggle with modeling cross-sentence/paragraph dependencies and maintaining discourse coherence in document-level machine translation. To address this, we propose DocBlocks—the first high-quality, fine-grained document-level parallel corpus—and a multi-paradigm fine-tuning framework that unifies instruction-driven document-to-document translation with context-aware chunked translation, jointly optimizing long-range dependency modeling and sentence-level fidelity. Our approach overcomes key limitations of conventional prompt engineering and agent-based architectures in both coherence preservation and inference efficiency. Experiments demonstrate substantial improvements on document-level translation: +2.1 BLEU and +4.3 COMET scores over strong baselines; 3.2× faster inference than agent-based methods; and competitive sentence-level translation performance.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated strong performance in sentence-level machine translation, but scaling to document-level translation remains challenging, particularly in modeling long-range dependencies and discourse phenomena across sentences and paragraphs. In this work, we propose a method to improve LLM-based long-document translation through targeted fine-tuning on high-quality document-level data, which we curate and introduce as DocBlocks. Our approach supports multiple translation paradigms, including direct document-to-document and chunk-level translation, by integrating instructions both with and without surrounding context. This enables models to better capture cross-sentence dependencies while maintaining strong sentence-level translation performance. Experimental results show that incorporating multiple translation paradigms improves document-level translation quality and inference speed compared to prompting and agent-based methods.
Problem

Research questions and friction points this paper is trying to address.

Improving document-level machine translation with LLMs
Modeling long-range dependencies in multilingual documents
Integrating multiple translation paradigms for better performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning LLMs with curated DocBlocks data
Supporting multiple document translation paradigms
Improving cross-sentence dependencies and translation speed
🔎 Similar Papers
No similar papers found.
M
Miguel Moura Ramos
Instituto Superior Técnico, Universidade de Lisboa (ELLIS Unit Lisbon), Instituto de Telecomunicações
Patrick Fernandes
Patrick Fernandes
Carnegie Mellon University & Instituto Superior Técnico
NLPMachine Learning
Sweta Agrawal
Sweta Agrawal
Research Scientist at Google
Machine TranslationNatural Language Generation and Evaluation
A
Andr'e F. T. Martins
Instituto Superior Técnico, Universidade de Lisboa (ELLIS Unit Lisbon), Instituto de Telecomunicações, Unbabel