🤖 AI Summary
Simultaneous speech translation inherently faces a trade-off between translation quality and latency. This paper proposes an information-gain-based adaptive waiting strategy: output is delayed only when newly arrived audio significantly reduces semantic uncertainty. To this end, we design the REINA loss function—grounded in information theory—to jointly optimize translation accuracy and streaming efficiency, and introduce a novel streaming efficiency metric. We perform policy distillation using a non-streaming model as the teacher, augmented with entropy regularization, and train on both open-source and synthetic multilingual data. Evaluated on English–French, English–Spanish, and English–German translation tasks, our method achieves state-of-the-art performance at comparable model sizes, improving streaming efficiency by up to 21%. It is the first work to explicitly model and optimize the Pareto frontier between quality and latency.
📝 Abstract
Simultaneous Speech Translation (SimulST) systems stream in audio while simultaneously emitting translated text or speech. Such systems face the significant challenge of balancing translation quality and latency. We introduce a strategy to optimize this tradeoff: wait for more input only if you gain information by doing so. Based on this strategy, we present Regularized Entropy INformation Adaptation (REINA), a novel loss to train an adaptive policy using an existing non-streaming translation model. We derive REINA from information theory principles and show that REINA helps push the reported Pareto frontier of the latency/quality tradeoff over prior works. Utilizing REINA, we train a SimulST model on French, Spanish and German, both from and into English. Training on only open source or synthetically generated data, we achieve state-of-the-art (SOTA) streaming results for models of comparable size. We also introduce a metric for streaming efficiency, quantitatively showing REINA improves the latency/quality trade-off by as much as 21% compared to prior approaches, normalized against non-streaming baseline BLEU scores.