BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address inaccurate periodicity modeling and insufficient long-range dependency capture in GAN-based vocoders for long-duration speech synthesis, this paper proposes BemaGANv2—a novel GAN vocoder designed for high-fidelity, long-audio waveform generation. Its core innovations include: (1) the Anti-aliased Multi-Periodicity (AMP) generator module—incorporating Snake activation—to enforce consistency between fundamental frequency and harmonic periodicities; and (2) a Multi-Envelope Discriminator (MED) jointly deployed with the Multi-Receptive-Field Discriminator (MRD) to systematically enhance long-term structural modeling. Comprehensive multi-scale evaluation (FAD, SSIM, PLCC, MCD, MOS, SMOS) demonstrates that BemaGANv2 significantly outperforms state-of-the-art GAN vocoders in both objective metrics and subjective MOS scores. The code and pre-trained models are fully open-sourced.

Technology Category

Application Category

📝 Abstract
This paper presents a tutorial-style survey and implementation guide of BemaGANv2, an advanced GAN-based vocoder designed for high-fidelity and long-term audio generation. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti-aliased Multi-Periodicity composition (AMP) module, which internally applies the Snake activation function to better model periodic structures. In the discriminator framework, we integrate the Multi-Envelope Discriminator (MED), a novel architecture we originally proposed, to extract rich temporal envelope features crucial for periodicity detection. Coupled with the Multi-Resolution Discriminator (MRD), this combination enables more accurate modeling of long-range dependencies in audio. We systematically evaluate various discriminator configurations, including MSD + MED, MSD + MRD, and MPD + MED + MRD, using objective metrics (FAD, SSIM, PLCC, MCD) and subjective evaluations (MOS, SMOS). This paper also provides a comprehensive tutorial on the model architecture, training methodology, and implementation to promote reproducibility. The code and pre-trained models are available at: https://github.com/dinhoitt/BemaGANv2.
Problem

Research questions and friction points this paper is trying to address.

Improving GAN-based vocoders for high-fidelity audio generation
Enhancing long-term audio modeling with novel AMP modules
Evaluating discriminator configurations for better periodicity detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

AMP module replaces ResBlocks in generator
MED discriminator extracts temporal envelope features
Combines MED and MRD for long-range dependencies
🔎 Similar Papers
No similar papers found.
T
Taesoo Park
Department of Electronic Engineering, Kwangwoon University, Seoul, South Korea
M
Mungwi Jeong
Department of Electronic Engineering, Kwangwoon University, Seoul, South Korea
M
Mingyu Park
Department of Electronic Engineering, Kwangwoon University, Seoul, South Korea
N
Narae Kim
Department of Electronic Engineering, Kwangwoon University, Seoul, South Korea
J
Junyoung Kim
Department of Electronic Engineering, Kwangwoon University, Seoul, South Korea
M
Mujung Kim
Department of Electronic Engineering, Kwangwoon University, Seoul, South Korea
Jisang Yoo
Jisang Yoo
SungKyunKwan University, Department of Intelligent Robotics
Computer VisionArtificial Intelligence
H
Hoyun Lee
Ewha Womans University College of Medicine Seoul, South Korea
S
Sanghoon Kim
School of Medicine, Kyung Hee University Seoul, South Korea
S
Soonchul Kwon
Graduate School of Smart Convergence, Kwangwoon University, Seoul, South Korea