🤖 AI Summary
Time-scale modification (TSM) of speech requires preserving pitch and perceptual naturalness during tempo alteration, yet conventional methods—e.g., WSOLA—often introduce artifacts under non-stationary conditions or extreme scaling. To address this, we propose a neural TSM framework leveraging FiLM-based conditional modulation, where continuous time-scale factors are injected as learnable conditioning signals into a deep architecture, enabling fine-grained and robust duration control. Crucially, we introduce a WSOLA-inspired reconstruction loss as weak supervision to reduce reliance on high-fidelity paired training data. The framework integrates diverse feature encoders—including STFT, WavLM, Whisper, and EnCodec—with a HiFiGAN-based vocoder, significantly enhancing cross-scale generalization. Experiments demonstrate that our method effectively suppresses distortion and robotic artifacts in both extreme acceleration and deceleration scenarios, achieving superior speech naturalness and perceptual consistency compared to state-of-the-art baselines.
📝 Abstract
Time-Scale Modification (TSM) of speech aims to alter the playback rate of audio without changing its pitch. While classical methods like Waveform Similarity-based Overlap-Add (WSOLA) provide strong baselines, they often introduce artifacts under non-stationary or extreme stretching conditions. We propose STSM-FILM - a fully neural architecture that incorporates Feature-Wise Linear Modulation (FiLM) to condition the model on a continuous speed factor. By supervising the network using WSOLA-generated outputs, STSM-FILM learns to mimic alignment and synthesis behaviors while benefiting from representations learned through deep learning. We explore four encoder--decoder variants: STFT-HiFiGAN, WavLM-HiFiGAN, Whisper-HiFiGAN, and EnCodec, and demonstrate that STSM-FILM is capable of producing perceptually consistent outputs across a wide range of time-scaling factors. Overall, our results demonstrate the potential of FiLM-based conditioning to improve the generalization and flexibility of neural TSM models.