MLLP-VRAIN UPV system for the IWSLT 2025 Simultaneous Speech Translation Translation task

📅 2025-06-23
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
To address high latency and poor coherence in long-form speech streaming translation, this work proposes a modular cascaded architecture: Whisper-Large-V3-Turbo serves as the ASR backbone, integrated with the NLLB-3.3B multilingual translation model. We introduce document-level prefix training to enhance modeling of incomplete inputs, design an adaptive wait-k emission strategy, and propose RALCP—a re-alignment mechanism leveraging cross-lingual context—and incorporate dynamic buffer management to jointly optimize responsiveness and output consistency. Evaluated on the ACL60/60 test set, our system achieves 31.96 BLEU with an average latency of only 2.94 seconds (StreamLAAL); on the official test set, it attains 29.8 BLEU—significantly outperforming baselines—and achieves a superior trade-off between translation quality and low latency.

Technology Category

Application Category

📝 Abstract
This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2025 Simultaneous Speech Translation track. Our submission addresses the unique challenges of real-time translation of long-form speech by developing a modular cascade system that adapts strong pre-trained models to streaming scenarios. We combine Whisper Large-V3-Turbo for ASR with the multilingual NLLB-3.3B model for MT, implementing lightweight adaptation techniques rather than training new end-to-end models from scratch. Our approach employs document-level adaptation with prefix training to enhance the MT model's ability to handle incomplete inputs, while incorporating adaptive emission policies including a wait-$k$ strategy and RALCP for managing the translation stream. Specialized buffer management techniques and segmentation strategies ensure coherent translations across long audio sequences. Experimental results on the ACL60/60 dataset demonstrate that our system achieves a favorable balance between translation quality and latency, with a BLEU score of 31.96 and non-computational-aware StreamLAAL latency of 2.94 seconds. Our final model achieves a preliminary score on the official test set (IWSLT25Instruct) of 29.8 BLEU. Our work demonstrates that carefully adapted pre-trained components can create effective simultaneous translation systems for long-form content without requiring extensive in-domain parallel data or specialized end-to-end training.
Problem

Research questions and friction points this paper is trying to address.

Real-time translation of long-form speech challenges
Adapting pre-trained models for streaming scenarios
Balancing translation quality and latency effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular cascade system with pre-trained models
Lightweight adaptation techniques for streaming
Document-level prefix training for MT
🔎 Similar Papers
No similar papers found.