STSM-FiLM: A FiLM-Conditioned Neural Architecture for Time-Scale Modification of Speech

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Time-scale modification (TSM) of speech requires preserving pitch and perceptual naturalness during tempo alteration, yet conventional methods—e.g., WSOLA—often introduce artifacts under non-stationary conditions or extreme scaling. To address this, we propose a neural TSM framework leveraging FiLM-based conditional modulation, where continuous time-scale factors are injected as learnable conditioning signals into a deep architecture, enabling fine-grained and robust duration control. Crucially, we introduce a WSOLA-inspired reconstruction loss as weak supervision to reduce reliance on high-fidelity paired training data. The framework integrates diverse feature encoders—including STFT, WavLM, Whisper, and EnCodec—with a HiFiGAN-based vocoder, significantly enhancing cross-scale generalization. Experiments demonstrate that our method effectively suppresses distortion and robotic artifacts in both extreme acceleration and deceleration scenarios, achieving superior speech naturalness and perceptual consistency compared to state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract
Time-Scale Modification (TSM) of speech aims to alter the playback rate of audio without changing its pitch. While classical methods like Waveform Similarity-based Overlap-Add (WSOLA) provide strong baselines, they often introduce artifacts under non-stationary or extreme stretching conditions. We propose STSM-FILM - a fully neural architecture that incorporates Feature-Wise Linear Modulation (FiLM) to condition the model on a continuous speed factor. By supervising the network using WSOLA-generated outputs, STSM-FILM learns to mimic alignment and synthesis behaviors while benefiting from representations learned through deep learning. We explore four encoder--decoder variants: STFT-HiFiGAN, WavLM-HiFiGAN, Whisper-HiFiGAN, and EnCodec, and demonstrate that STSM-FILM is capable of producing perceptually consistent outputs across a wide range of time-scaling factors. Overall, our results demonstrate the potential of FiLM-based conditioning to improve the generalization and flexibility of neural TSM models.
Problem

Research questions and friction points this paper is trying to address.

Altering speech playback rate without pitch change
Reducing artifacts in non-stationary stretching conditions
Improving neural TSM generalization using FiLM conditioning
Innovation

Methods, ideas, or system contributions that make the work stand out.

FiLM-conditioned neural architecture for speech time-scaling
Supervised learning using WSOLA-generated output alignments
Four encoder-decoder variants tested for consistent time-scaling
🔎 Similar Papers
No similar papers found.
D
Dyah A. M. G. Wisnu
Research Center for Information Technology Innovation, Academia Sinica, Taiwan
R
Ryandhimas E. Zezario
Research Center for Information Technology Innovation, Academia Sinica, Taiwan
S
Stefano Rini
Institute of Communications Engineering, National Yang Ming Chiao Tung University, Taiwan
F
Fo-Rui Li
Dept. of Biomedical Science and Engineering, National Central University, Taoyuan, Taiwan
Yan-Tsung Peng
Yan-Tsung Peng
National Chengchi University
Hsin-Min Wang
Hsin-Min Wang
Research Fellow/Professor, Institute of Information Sience, Academia Sinica
Spoken Language ProcessingNatural Language ProcessingMultimedia Information RetrievalMachine Learning
Yu Tsao
Yu Tsao
Research Fellow (Professor), Deputy Director, CITI, Academia Sinica
Assistive Oral Communication TechnologiesSpeech EnhancementVoice ConversionSpeech Assessment