๐ค AI Summary
This work addresses the challenge of converting whispered speech to normal speech in real-world scenarios, where utterances are temporally misaligned and paired training data are unavailable. The authors propose FlowW2N, a method based on conditional flow matching that leverages only synthetic, time-aligned whisperโnormal speech pairs for training. By incorporating domain-invariant high-level embeddings from an automatic speech recognition (ASR) model as conditioning signals, FlowW2N significantly enhances generalization to real whispered speech. Through systematic evaluation, the authors identify ASR embedding layers that exhibit strong cross-domain invariance and rich content information, enabling FlowW2N to achieve state-of-the-art performance on the CHAINS and wTIMIT datasets, with relative word error rate reductions of 26%โ46%. The model requires only 10 inference steps and operates without any real paired training data.
๐ Abstract
Whispered-to-normal (W2N) speech conversion aims to reconstruct missing phonation from whispered input while preserving content and speaker identity. This task is challenging due to temporal misalignment between whisper and voiced recordings and lack of paired data. We propose FlowW2N, a conditional flow matching approach that trains exclusively on synthetic, time-aligned whisper-normal pairs and conditions on domain-invariant features. We exploit high-level ASR embeddings that exhibits strong invariance between synthetic and real whispered speech, enabling generalization to real whispers despite never observing it during training. We verify this invariance across ASR layers and propose a selection criterion optimizing content informativeness and cross-domain invariance. Our method achieves SOTA intelligibility on the CHAINS and wTIMIT datasets, reducing Word Error Rate by 26-46% relative to prior work while using only 10 steps at inference and requiring no real paired data.