STSM-FiLM: A FiLM-Conditioned Neural Architecture for Time-Scale Modification of Speech

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Time-scale modification (TSM) of speech requires preserving pitch and perceptual naturalness during tempo alteration, yet conventional methods—e.g., WSOLA—often introduce artifacts under non-stationary conditions or extreme scaling. To address this, we propose a neural TSM framework leveraging FiLM-based conditional modulation, where continuous time-scale factors are injected as learnable conditioning signals into a deep architecture, enabling fine-grained and robust duration control. Crucially, we introduce a WSOLA-inspired reconstruction loss as weak supervision to reduce reliance on high-fidelity paired training data. The framework integrates diverse feature encoders—including STFT, WavLM, Whisper, and EnCodec—with a HiFiGAN-based vocoder, significantly enhancing cross-scale generalization. Experiments demonstrate that our method effectively suppresses distortion and robotic artifacts in both extreme acceleration and deceleration scenarios, achieving superior speech naturalness and perceptual consistency compared to state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract

Time-Scale Modification (TSM) of speech aims to alter the playback rate of audio without changing its pitch. While classical methods like Waveform Similarity-based Overlap-Add (WSOLA) provide strong baselines, they often introduce artifacts under non-stationary or extreme stretching conditions. We propose STSM-FILM - a fully neural architecture that incorporates Feature-Wise Linear Modulation (FiLM) to condition the model on a continuous speed factor. By supervising the network using WSOLA-generated outputs, STSM-FILM learns to mimic alignment and synthesis behaviors while benefiting from representations learned through deep learning. We explore four encoder--decoder variants: STFT-HiFiGAN, WavLM-HiFiGAN, Whisper-HiFiGAN, and EnCodec, and demonstrate that STSM-FILM is capable of producing perceptually consistent outputs across a wide range of time-scaling factors. Overall, our results demonstrate the potential of FiLM-based conditioning to improve the generalization and flexibility of neural TSM models.

Problem

Research questions and friction points this paper is trying to address.

Altering speech playback rate without pitch change

Reducing artifacts in non-stationary stretching conditions

Improving neural TSM generalization using FiLM conditioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

FiLM-conditioned neural architecture for speech time-scaling

Supervised learning using WSOLA-generated output alignments

Four encoder-decoder variants tested for consistent time-scaling

🔎 Similar Papers

No similar papers found.

Authors to Follow