π€ AI Summary
This study addresses key challenges in cross-lingual voice cloning, including speaker identity preservation, accent transfer, and inaccurate pronunciation of domain-specific terminology. Building upon the multilingual TTS model FishAudio-S2-Pro, the work proposes three core innovations: the incorporation of language token prompts to enhance linguistic control and effectively suppress accent leakage; a reference audio-guided lexical matching strategy that substantially improves pronunciation accuracy for domain terms; and, to the best of our knowledge, the first application of reinforcement learning fine-tuning to this task, optimizing intelligibility of the synthesized speech. Experimental results demonstrate that language prompting yields the largest performance gain, lexical matching consistently enhances pronunciation on overlapping vocabulary, and the combined approach significantly improves overall naturalness and clarity of the generated speech.
π Abstract
Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.