MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction

πŸ“… 2025-10-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In speech-to-speech translation (S2ST), speech tokens often exhibit sparse semantics, and individual tokens struggle to represent complete semantic units. To address this, we propose Hidden-layer Multi-Target Prediction (MTP): within the speech-to-unit translation (S2UT) framework, we apply multi-target prediction loss not only at the output layer but also to intermediate hidden layers, thereby encouraging the model to learn denser and more holistic semantic representations earlier in the encoding process. This design is jointly optimized with Connectionist Temporal Classification (CTC) loss to achieve hierarchical semantic enhancement. Experiments across multiple benchmarks demonstrate that all MTP variants consistently improve translation quality; notably, MTP-S2UT achieves state-of-the-art performance, validating that strengthening semantic representation in hidden layers significantly enhances the model’s capacity for speech translation.

Technology Category

Application Category

πŸ“ Abstract
Current direct speech-to-speech translation methods predominantly employ speech tokens as intermediate representations. However, a single speech token is not dense in semantics, so we generally need multiple tokens to express a complete semantic unit. To address this limitation, we introduce multi-token prediction (MTP) loss into speech-to-unit translation (S2UT) models, enabling models to predict multiple subsequent tokens at each position, thereby capturing more complete semantics and enhancing information density per position. Initial MTP implementations apply the loss at the final layer, which improves output representation but initiates information enrichment too late. We hypothesize that advancing the information enrichment process to intermediate layers can achieve earlier and more effective enhancement of hidden representation. Consequently, we propose MTP-S2UT loss, applying MTP loss to hidden representation where CTC loss is computed. Experiments demonstrate that all MTP loss variants consistently improve the quality of S2UT translation, with MTP-S2UT achieving the best performance.
Problem

Research questions and friction points this paper is trying to address.

Enhancing semantic density in speech-to-speech translation
Addressing late-stage information enrichment in translation models
Improving hidden representation quality through multi-token prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-token prediction enhances speech-to-unit translation
MTP loss applied to hidden representations with CTC
Earlier semantic enrichment improves translation quality
πŸ”Ž Similar Papers
No similar papers found.
J
Jianjin Wang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
R
Runsong Zhao
School of Computer Science and Engineering, Northeastern University, Shenyang, China
X
Xiaoqian Liu
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yuan Ge
Yuan Ge
Northeastern University, China
ReasoningMultimodality LLMs
Z
Ziqiang Xu
School of Computer Science and Engineering, Northeastern University, Shenyang, China
T
Tong Xiao
School of Computer Science and Engineering, Northeastern University, Shenyang, China; NiuTrans Research, Shenyang, China
S
Shengxiang Gao
Kunming University of Science and Technology, Kunming, China
Zhengtao Yu
Zhengtao Yu
Kunming University of Science and Technology
Jingbo Zhu
Jingbo Zhu
Northeastern University, China
Machine TranslationLanguage ParsingNatural Language Processing