🤖 AI Summary
This work addresses the challenge of achieving efficient read/write decision-making in streaming simultaneous speech translation for long-form audio without requiring additional training. It proposes a fine-tuning-free decoding strategy that leverages the self-attention mechanism of off-the-shelf Speech Language Models (SpeechLLMs), such as Phi4-Multimodal and Qwen3-Omni, to extract stable alignment signals. For the first time, it demonstrates that reliable proxy alignments can be derived solely from the inherent self-attention patterns of these models, which are then used to drive low-latency translation decisions. The method significantly reduces latency while preserving offline translation quality, enabling effective, training-free streaming simultaneous interpretation suitable for extended speech inputs.
📝 Abstract
Simultaneous speech-to-text translation (SimulST) generates translations while speech is still unfolding, requiring a streaming policy that decides when to read and when to write. State-of-the-art approaches rely on attention-based encoder-decoder models where cross-attention provides explicit alignment signals. In contrast, Speech Large Language Models (SpeechLLMs) are decoder-only architectures relying solely on self-attention. This raises a central question: whether decoder self-attention contains sufficiently stable alignment signals to guide the streaming policy. Moreover, existing approaches typically rely on training-based adaptations or heuristic wait-$k$ policies and have not been validated in long-form settings. To fill these gaps, we propose Decoder-Only Attention (DOA), a training-free policy that enables long-form simultaneous translation with off-the-shelf SpeechLLMs by deriving a proxy alignment from self-attention. Experiments on Phi4-Multimodal and Qwen3-Omni show that DOA provides an effective alignment signal for supporting streaming decisions, enabling low-latency long-form SimulST with quality close to offline decoding without retraining.