BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

283K/year

🤖 AI Summary

To address inaccurate periodicity modeling and insufficient long-range dependency capture in GAN-based vocoders for long-duration speech synthesis, this paper proposes BemaGANv2—a novel GAN vocoder designed for high-fidelity, long-audio waveform generation. Its core innovations include: (1) the Anti-aliased Multi-Periodicity (AMP) generator module—incorporating Snake activation—to enforce consistency between fundamental frequency and harmonic periodicities; and (2) a Multi-Envelope Discriminator (MED) jointly deployed with the Multi-Receptive-Field Discriminator (MRD) to systematically enhance long-term structural modeling. Comprehensive multi-scale evaluation (FAD, SSIM, PLCC, MCD, MOS, SMOS) demonstrates that BemaGANv2 significantly outperforms state-of-the-art GAN vocoders in both objective metrics and subjective MOS scores. The code and pre-trained models are fully open-sourced.

Technology Category

Application Category

📝 Abstract

This paper presents a tutorial-style survey and implementation guide of BemaGANv2, an advanced GAN-based vocoder designed for high-fidelity and long-term audio generation. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti-aliased Multi-Periodicity composition (AMP) module, which internally applies the Snake activation function to better model periodic structures. In the discriminator framework, we integrate the Multi-Envelope Discriminator (MED), a novel architecture we originally proposed, to extract rich temporal envelope features crucial for periodicity detection. Coupled with the Multi-Resolution Discriminator (MRD), this combination enables more accurate modeling of long-range dependencies in audio. We systematically evaluate various discriminator configurations, including MSD + MED, MSD + MRD, and MPD + MED + MRD, using objective metrics (FAD, SSIM, PLCC, MCD) and subjective evaluations (MOS, SMOS). This paper also provides a comprehensive tutorial on the model architecture, training methodology, and implementation to promote reproducibility. The code and pre-trained models are available at: https://github.com/dinhoitt/BemaGANv2.

Problem

Research questions and friction points this paper is trying to address.

Improving GAN-based vocoders for high-fidelity audio generation

Enhancing long-term audio modeling with novel AMP modules

Evaluating discriminator configurations for better periodicity detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

AMP module replaces ResBlocks in generator

MED discriminator extracts temporal envelope features

Combines MED and MRD for long-range dependencies

🔎 Similar Papers

No similar papers found.