KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses key challenges in cross-lingual voice cloning, including speaker identity preservation, accent transfer, and inaccurate pronunciation of domain-specific terminology. Building upon the multilingual TTS model FishAudio-S2-Pro, the work proposes three core innovations: the incorporation of language token prompts to enhance linguistic control and effectively suppress accent leakage; a reference audio-guided lexical matching strategy that substantially improves pronunciation accuracy for domain terms; and, to the best of our knowledge, the first application of reinforcement learning fine-tuning to this task, optimizing intelligibility of the synthesized speech. Experimental results demonstrate that language prompting yields the largest performance gain, lexical matching consistently enhances pronunciation on overlapping vocabulary, and the combined approach significantly improves overall naturalness and clarity of the generated speech.

📝 Abstract

Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.

Problem

Research questions and friction points this paper is trying to address.

cross-lingual voice cloning

speaker identity preservation

accent variation

domain-specific vocabulary

speech intelligibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

language tag prompting

reinforcement learning fine-tuning

reference-conditioned lexical matching

cross-lingual voice cloning

accent leakage reduction

🔎 Similar Papers

People are poorly equipped to detect AI-powered voice clones

2024-10-03arXiv.orgCitations: 1