LLMs Can Achieve High-quality Simultaneous Machine Translation as Efficiently as Offline

📅 2025-04-13

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the high latency and low translation quality of large language models (LLMs) in simultaneous machine translation (SiMT) caused by autoregressive decoding, this paper proposes an architecture-free interleaved supervised fine-tuning (SFT) framework. Methodologically, it introduces source–target token interleaving for sequence modeling, integrated with special-token-guided dynamic latency control and prompt-driven adaptive autoregressive decoding—unifying SiMT and offline translation capabilities within a single model. The approach requires only a small amount of SFT data and achieves state-of-the-art performance across multiple SiMT benchmarks, significantly outperforming dedicated SiMT models. Crucially, it fully preserves the original offline translation capability without degradation and generalizes zero-shot to document-level SiMT scenarios. This demonstrates that high-quality, low-latency simultaneous translation can be attained without architectural modifications or task-specific model variants, offering a scalable and unified paradigm for LLM-based MT.

Technology Category

Application Category

📝 Abstract

When the complete source sentence is provided, Large Language Models (LLMs) perform excellently in offline machine translation even with a simple prompt"Translate the following sentence from [src lang] into [tgt lang]:". However, in many real scenarios, the source tokens arrive in a streaming manner and simultaneous machine translation (SiMT) is required, then the efficiency and performance of decoder-only LLMs are significantly limited by their auto-regressive nature. To enable LLMs to achieve high-quality SiMT as efficiently as offline translation, we propose a novel paradigm that includes constructing supervised fine-tuning (SFT) data for SiMT, along with new training and inference strategies. To replicate the token input/output stream in SiMT, the source and target tokens are rearranged into an interleaved sequence, separated by special tokens according to varying latency requirements. This enables powerful LLMs to learn read and write operations adaptively, based on varying latency prompts, while still maintaining efficient auto-regressive decoding. Experimental results show that, even with limited SFT data, our approach achieves state-of-the-art performance across various SiMT benchmarks, and preserves the original abilities of offline translation. Moreover, our approach generalizes well to document-level SiMT setting without requiring specific fine-tuning, even beyond the offline translation model.

Problem

Research questions and friction points this paper is trying to address.

Enabling LLMs to perform high-quality simultaneous machine translation efficiently

Overcoming auto-regressive limitations in streaming source token scenarios

Adapting LLMs for low-latency translation without sacrificing offline performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs interleaved source-target sequences for SiMT

Uses special tokens for adaptive read-write operations

Maintains efficient auto-regressive decoding in LLMs

🔎 Similar Papers

Conversational SimulMT: Efficient Simultaneous Translation with Large Language Models