Conversational SimulMT: Efficient Simultaneous Translation with Large Language Models

📅 2024-02-16

🏛️ International Workshop on Spoken Language Translation

📈 Citations: 4

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Simultaneous machine translation (SimulMT) faces a fundamental trade-off among translation quality, latency, and the computational cost of large language model (LLM) inference. To address this, we propose the first multi-turn conversational decoding framework tailored for SimulMT, integrating Llama2-7b-chat into streaming translation. Our approach introduces a dynamic waiting policy and a lightweight context compression mechanism to substantially reduce autoregressive decoding overhead. Crucially, it shifts from conventional single-pass generation to iterative, interactive decoding—enabling fine-grained incremental output while preserving semantic coherence. Evaluated on two standard SimulMT benchmarks, our method surpasses dedicated SimulMT models in BLEU score, achieves comparable average latency, and reduces latency by over 42% compared to standard LLM-based streaming translation. To the best of our knowledge, this is the first work to jointly achieve high translation quality, low latency, and computational efficiency in LLM-driven SimulMT.

Technology Category

Application Category

📝 Abstract

Simultaneous machine translation (SimulMT) presents a challenging trade-off between translation quality and latency. Recent studies have shown that LLMs can achieve good performance in SimulMT tasks. However, this often comes at the expense of high inference cost and latency. In this paper, we propose a conversational SimulMT framework to enhance the inference efficiency of LLM-based SimulMT through multi-turn-dialogue-based decoding. Our experiments with Llama2-7b-chat on two SimulMT benchmarks demonstrate the superiority of LLM in translation quality while achieving comparable computational latency to specialized SimulMT models.

Problem

Research questions and friction points this paper is trying to address.

Balancing translation quality and latency in simultaneous machine translation

Reducing high inference cost and latency in LLM-based SimulMT

Enhancing efficiency of simultaneous translation using dialogue-based decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn dialogue decoding enhances LLM efficiency

Conversational framework reduces inference cost and latency

LLMs achieve quality comparable to specialized translation models

🔎 Similar Papers

No similar papers found.