FlowW2N: Whispered-to-Normal Speech Conversion via Flow-Matching

๐Ÿ“… 2026-03-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of converting whispered speech to normal speech in real-world scenarios, where utterances are temporally misaligned and paired training data are unavailable. The authors propose FlowW2N, a method based on conditional flow matching that leverages only synthetic, time-aligned whisperโ€“normal speech pairs for training. By incorporating domain-invariant high-level embeddings from an automatic speech recognition (ASR) model as conditioning signals, FlowW2N significantly enhances generalization to real whispered speech. Through systematic evaluation, the authors identify ASR embedding layers that exhibit strong cross-domain invariance and rich content information, enabling FlowW2N to achieve state-of-the-art performance on the CHAINS and wTIMIT datasets, with relative word error rate reductions of 26%โ€“46%. The model requires only 10 inference steps and operates without any real paired training data.

Technology Category

Application Category

๐Ÿ“ Abstract
Whispered-to-normal (W2N) speech conversion aims to reconstruct missing phonation from whispered input while preserving content and speaker identity. This task is challenging due to temporal misalignment between whisper and voiced recordings and lack of paired data. We propose FlowW2N, a conditional flow matching approach that trains exclusively on synthetic, time-aligned whisper-normal pairs and conditions on domain-invariant features. We exploit high-level ASR embeddings that exhibits strong invariance between synthetic and real whispered speech, enabling generalization to real whispers despite never observing it during training. We verify this invariance across ASR layers and propose a selection criterion optimizing content informativeness and cross-domain invariance. Our method achieves SOTA intelligibility on the CHAINS and wTIMIT datasets, reducing Word Error Rate by 26-46% relative to prior work while using only 10 steps at inference and requiring no real paired data.
Problem

Research questions and friction points this paper is trying to address.

Whispered-to-normal speech conversion
phonation reconstruction
temporal misalignment
unpaired data
speaker identity preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

flow matching
whispered-to-normal speech conversion
domain-invariant features
ASR embeddings
synthetic data training
๐Ÿ”Ž Similar Papers
No similar papers found.
Fabian Ritter-Gutierrez
Fabian Ritter-Gutierrez
PhD Student Nanyang Technological University
Automatic Speech RecognitionNeural Architecture SearchSelf-Supervised Learning
Md Asif Jalal
Md Asif Jalal
Machine Learning researcher
Machine LearningASRSpeech ProcessingAffective ComputingGenerative AI
Pablo Peso Parada
Pablo Peso Parada
AI Researcher - Samsung Research UK
signal processingmachine learningopen source hardwareaudiospeech
K
Karthikeyan Saravanan
Samsung Electronics R&D Institute UK (SRUK), London, United Kingdom
Y
Yusun Shul
Samsung Electronics, Mobile eXperience Business, Suwon, Republic of Korea
M
Minseung Kim
Samsung Electronics, Mobile eXperience Business, Suwon, Republic of Korea
G
Gun-Woo Lee
Samsung Electronics, Mobile eXperience Business, Suwon, Republic of Korea
H
Han-Gil Moon
Samsung Electronics, Mobile eXperience Business, Suwon, Republic of Korea