Timbre-Aware LLM-based Direct Speech-to-Speech Translation Extendable to Multiple Language Pairs

📅 2026-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes DS2ST-LM, a novel end-to-end speech-to-speech translation (S2ST) framework that addresses the challenges of unstable semantic-acoustic alignment, poor speaker identity preservation, and limited multilingual scalability in direct S2ST systems. For the first time, it integrates a voice timbre-aware mechanism into a large language model (LLM)-based architecture, combining a Whisper encoder, learnable projection modules (Linear/Conv1D-Linear/Q-Former), the Qwen2-0.5B LLM, and a timbre-controllable vocoder. High-fidelity synthetic target speech is leveraged to mitigate the scarcity of parallel corpora. Experimental results demonstrate that DS2ST-LM outperforms both cascade and ST+TTS baselines across multiple metrics—including BLEU, METEOR, BLEURT, and COMET—while significantly improving speech naturalness and speaker similarity. The framework also successfully extends to multiple languages, including French, Spanish, German, Hindi, Bengali, and Urdu.

Technology Category

Application Category

📝 Abstract
Direct Speech-to-Speech Translation (S2ST) has gained increasing attention for its ability to translate speech from one language to another, while reducing error propagation and latency inherent in traditional cascaded pipelines. However, existing direct S2ST systems continue to face notable challenges, including instability in semantic-acoustic alignment when parallel speech data is scarce, difficulty in preserving speaker identity, and limited multilingual scalability. In this work, we introduce DS2ST-LM, a scalable, single-stage direct S2ST framework leveraging a multilingual Large Language Model (LLM). The architecture integrates a Whisper speech encoder, a learnable projection module, a Qwen2-0.5B LLM, and a timbre-controlled vocoder. We construct GigaS2S-1000, a 1000-hour bilingual corpus by extending the GigaST dataset with high-fidelity synthetic target speech, and show that this synthetic data alleviates data scarcity to some extent. We investigate two semantic token generation strategies: speech-derived S3 tokens and text-derived tokens generated by a pre-trained LLM, and analyze their impact on training stability and semantic consistency. We further evaluate three projection architectures (Linear, Conv1D-Linear, and Q-Former) and observe that while higher-capacity projectors converge faster, the simple Linear projector achieves higher performance. Extensive experiments demonstrate that DS2ST-LM outperforms traditional cascaded and ST (Qwen-Audio) + TTS baselines across both lexical (BLEU, METEOR) and semantic (BLEURT, COMET) metrics, while extending to multiple language pairs, including French, Spanish, German, Hindi, Bengali, and Urdu. Furthermore, we incorporate timbre-aware speech synthesis to preserve speaker information, enabling DS2ST-LM to surpass prior direct S2ST systems in both speaker similarity and perceptual naturalness.
Problem

Research questions and friction points this paper is trying to address.

Direct Speech-to-Speech Translation
speaker identity preservation
multilingual scalability
semantic-acoustic alignment
data scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct Speech-to-Speech Translation
Timbre-Aware Synthesis
Multilingual LLM
Semantic Token Strategy
Speaker Identity Preservation
🔎 Similar Papers
No similar papers found.
L
Lalaram Arya
Indian Institute of Technology Dharwad, Dharwad, India
Mrinmoy Bhattacharjee
Mrinmoy Bhattacharjee
Assistant Professor, IIT Jammu
Automatic Speech RecognitionSpeaker VerificationSpeech/Audio Processing
R
R. Adarsh C.
Indian Institute of Technology Dharwad, Dharwad, India
S
S. R. M. Prasanna
Indian Institute of Technology Dharwad, Dharwad, India