Wave-Trainer-Fit: Neural Vocoder with Trainable Prior and Fixed-Point Iteration towards High-Quality Speech Generation from SSL features

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in generating high-fidelity speech waveforms from self-supervised learning (SSL) features, including modeling complexity, multi-step inference, and insufficient speaker similarity. Building upon the WaveFit vocoder, the proposed method integrates diffusion models with generative adversarial networks and introduces a trainable prior to replace conventional Gaussian noise initialization. Additionally, it incorporates reference-aware gain adjustment and energy-matching constraints to enhance output quality. The approach significantly reduces the number of inference iterations required, enabling the synthesis of natural and high-fidelity speech in fewer steps while substantially improving speaker similarity. Furthermore, the method demonstrates strong robustness to variations in SSL feature extraction depth, maintaining consistent performance across different representation granularities.

Technology Category

Application Category

📝 Abstract
We propose WaveTrainerFit, a neural vocoder that performs high-quality waveform generation from data-driven features such as SSL features. WaveTrainerFit builds upon the WaveFit vocoder, which integrates diffusion model and generative adversarial network. Furthermore, the proposed method incorporates the following key improvements: 1. By introducing trainable priors, the inference process starts from noise close to the target speech instead of Gaussian noise. 2. Reference-aware gain adjustment is performed by imposing constraints on the trainable prior to matching the speech energy. These improvements are expected to reduce the complexity of waveform modeling from data-driven features, enabling high-quality waveform generation with fewer inference steps. Through experiments, we showed that WaveTrainerFit can generate highly natural waveforms with improved speaker similarity from data-driven features, while requiring fewer iterations than WaveFit. Moreover, we showed that the proposed method works robustly with respect to the depth at which SSL features are extracted. Code and pre-trained models are available from https://github.com/line/WaveTrainerFit.
Problem

Research questions and friction points this paper is trying to address.

neural vocoder
SSL features
high-quality speech generation
waveform modeling
speaker similarity
Innovation

Methods, ideas, or system contributions that make the work stand out.

trainable prior
fixed-point iteration
neural vocoder
SSL features
waveform generation
🔎 Similar Papers
No similar papers found.