Aliasing-Free Neural Audio Synthesis

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing neural vocoders and audio codecs suffer from limited synthesis fidelity due to aliasing artifacts: nonlinear activations introduce ultrasonic harmonics beyond the Nyquist frequency, while transposed convolutions (ConvTranspose) cause mirror aliasing and pitch-related distortions. This work establishes an anti-aliasing waveform modeling paradigm grounded in signal processing principles. We propose two key innovations: (1) a novel oversampling scheme coupled with an antiderivative activation function to suppress foldover aliasing; and (2) replacement of ConvTranspose with learnable resampling layers to eliminate mirror aliasing and fixed-frequency ringing. Based on these, we design Pupu-Vocoder/Codec—a lightweight, time-domain architecture. Evaluated across singing voice, music, and general audio synthesis, it consistently surpasses state-of-the-art methods; in speech synthesis, it remains highly competitive. We publicly release pre-trained models and a dedicated anti-aliasing evaluation benchmark.

Technology Category

Application Category

📝 Abstract

Neural vocoders and codecs reconstruct waveforms from acoustic representations, which directly impact the audio quality. Among existing methods, upsampling-based time-domain models are superior in both inference speed and synthesis quality, achieving state-of-the-art performance. Still, despite their success in producing perceptually natural sound, their synthesis fidelity remains limited due to the aliasing artifacts brought by the inadequately designed model architectures. In particular, the unconstrained nonlinear activation generates an infinite number of harmonics that exceed the Nyquist frequency, resulting in ``folded-back'' aliasing artifacts. The widely used upsampling layer, ConvTranspose, copies the mirrored low-frequency parts to fill the empty high-frequency region, resulting in ``mirrored'' aliasing artifacts. Meanwhile, the combination of its inherent periodicity and the mirrored DC bias also brings ``tonal artifact,'' resulting in constant-frequency ringing. This paper aims to solve these issues from a signal processing perspective. Specifically, we apply oversampling and anti-derivative anti-aliasing to the activation function to obtain its anti-aliased form, and replace the problematic ConvTranspose layer with resampling to avoid the ``tonal artifact'' and eliminate aliased components. Based on our proposed anti-aliased modules, we introduce Pupu-Vocoder and Pupu-Codec, and release high-quality pre-trained checkpoints to facilitate audio generation research. We build a test signal benchmark to illustrate the effectiveness of the anti-aliased modules, and conduct experiments on speech, singing voice, music, and audio to validate our proposed models. Experimental results confirm that our lightweight Pupu-Vocoder and Pupu-Codec models can easily outperform existing systems on singing voice, music, and audio, while achieving comparable performance on speech.

Problem

Research questions and friction points this paper is trying to address.

Address aliasing artifacts in neural audio synthesis models.

Eliminate mirrored and tonal artifacts from upsampling layers.

Improve synthesis fidelity across speech, singing, and music.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Anti-aliased activation via oversampling and anti-derivative

Replacing ConvTranspose with resampling to avoid artifacts

Introducing lightweight Pupu-Vocoder and Pupu-Codec models

🔎 Similar Papers

No similar papers found.