UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance bottleneck of conventional two-stage audio super-resolution (SR) methods, which rely on pre-trained neural vocoders. We propose a vocoder-free, end-to-end flow-matching generative framework. Methodologically, we directly model the conditional distribution in the complex short-time Fourier transform (STFT) spectral domain, employing flow matching to learn the mapping from low-frequency to full-band spectra, and reconstruct high-fidelity waveforms via differentiable inverse STFT (iSTFT). The framework supports arbitrary integer upsampling factors with unified training and inference. Our key contribution is the first application of flow matching to vocoder-free audio SR, eliminating vocoder-induced distortions and cascaded errors. Experiments demonstrate state-of-the-art performance on both speech and general audio datasets, enabling stable generation of high-quality 48 kHz waveforms while improving training efficiency and cross-dataset generalization.

Technology Category

Application Category

📝 Abstract
In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel-spectrogram and then rely on a pre-trained neural vocoder to synthesize waveforms, our method directly reconstructs waveforms via the inverse Short-Time Fourier Transform (iSTFT), thereby eliminating the dependence on a separate vocoder. This design not only simplifies end-to-end optimization but also overcomes a critical bottleneck of two-stage pipelines, where the final audio quality is fundamentally constrained by vocoder performance. Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors, achieving state-of-the-art performance on both speech and general audio datasets.
Problem

Research questions and friction points this paper is trying to address.

Directly reconstructs waveforms via inverse STFT without vocoders
Overcomes vocoder performance bottleneck in audio super-resolution
Unified framework works across diverse upsampling factors and audio types
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vocoder-free flow matching for audio super-resolution
Direct waveform reconstruction via inverse STFT
End-to-end optimization without separate vocoder dependency
🔎 Similar Papers
No similar papers found.
W
Woongjib Choi
Dept. of Electrical & Electronic Engineering, Yonsei University, Seoul, South Korea
S
Sangmin Lee
Dept. of Electrical & Electronic Engineering, Yonsei University, Seoul, South Korea
H
Hyungseob Lim
Dept. of Electrical & Electronic Engineering, Yonsei University, Seoul, South Korea
Hong-Goo Kang
Hong-Goo Kang
Yonsei University