🤖 AI Summary
This work addresses the challenge of fragmented and frequently paused speech output in simultaneous speech-to-speech translation, which arises from stringent low-latency constraints and compromises fluency while increasing listeners’ cognitive load. The authors propose a fluency-aware optimization framework that, for the first time, leverages internal model signals—such as linguistic diversity and prosodic duration variation—to dynamically adjust speech output pacing without sacrificing latency. By adaptively modulating the rhythm of generated speech, the method effectively reduces inter-segment silences and enhances overall coherence. Evaluated on both short- and long-form benchmarks, the approach consistently yields more natural-sounding speech streams while maintaining competitive translation quality and latency performance.
📝 Abstract
Simultaneous speech-to-speech translation aims to enable near-real-time communication by minimizing latency, offering a compelling, real-time alternative to the high latency of consecutive translation. However, the excessive pursuit of low latency often results in fragmented chunk-wise speech. Consequently, listeners are subjected to an unnatural acoustic flow punctuated by frequent pauses, which could increase their cognitive load. To bridge this gap, we introduce a fluency-aware optimization framework designed to discover the sweet spot between the low-latency benefits of simultaneous translation and the natural flow of consecutive translation. Our framework minimizes inter-chunk silences by leveraging model-internal signals, including linguistic diversity and induced temporal variability in speech durations. Experiments on short- and long-form benchmarks show that our framework produces natural speech flow while maintaining competitive latency and translation quality.