๐ค AI Summary
Simultaneous machine translation (SimulMT) faces a fundamental trade-off among translation quality, latency, and the computational cost of large language model (LLM) inference. To address this, we propose the first multi-turn conversational decoding framework tailored for SimulMT, integrating Llama2-7b-chat into streaming translation. Our approach introduces a dynamic waiting policy and a lightweight context compression mechanism to substantially reduce autoregressive decoding overhead. Crucially, it shifts from conventional single-pass generation to iterative, interactive decodingโenabling fine-grained incremental output while preserving semantic coherence. Evaluated on two standard SimulMT benchmarks, our method surpasses dedicated SimulMT models in BLEU score, achieves comparable average latency, and reduces latency by over 42% compared to standard LLM-based streaming translation. To the best of our knowledge, this is the first work to jointly achieve high translation quality, low latency, and computational efficiency in LLM-driven SimulMT.
๐ Abstract
Simultaneous machine translation (SimulMT) presents a challenging trade-off between translation quality and latency. Recent studies have shown that LLMs can achieve good performance in SimulMT tasks. However, this often comes at the expense of high inference cost and latency. In this paper, we propose a conversational SimulMT framework to enhance the inference efficiency of LLM-based SimulMT through multi-turn-dialogue-based decoding. Our experiments with Llama2-7b-chat on two SimulMT benchmarks demonstrate the superiority of LLM in translation quality while achieving comparable computational latency to specialized SimulMT models.