EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

📅 2024-09-17
🏛️ arXiv.org
📈 Citations: 8
Influential: 2
📄 PDF
🤖 AI Summary
To address core challenges in text-to-audio (T2A) generation—namely low audio fidelity, high computational cost, slow sampling, and data scarcity—this paper introduces EzAudio, the first efficient diffusion Transformer model operating in the latent space of a 1D waveform VAE. Its contributions are fourfold: (1) waveform-level latent modeling to avoid spectrogram-domain artifacts; (2) a lightweight, audio-optimized diffusion Transformer architecture; (3) a three-stage, data-efficient training paradigm—unlabeled → AI-annotated → human-annotated—to mitigate labeling bottlenecks; and (4) a novel classifier-free guidance (CFG) rescaling strategy that balances prompt alignment and audio quality stability. Experiments demonstrate that EzAudio surpasses all existing open-source T2A models in both objective metrics and subjective MOS scores, achieving significantly improved audio fidelity. It reduces training cost by 40%, accelerates convergence by 2.3×, and supports fully reproducible end-to-end training.

Technology Category

Application Category

📝 Abstract
Latent diffusion models have shown promising results in text-to-audio (T2A) generation tasks, yet previous models have encountered difficulties in generation quality, computational cost, diffusion sampling, and data preparation. In this paper, we introduce EzAudio, a transformer-based T2A diffusion model, to handle these challenges. Our approach includes several key innovations: (1) We build the T2A model on the latent space of a 1D waveform Variational Autoencoder (VAE), avoiding the complexities of handling 2D spectrogram representations and using an additional neural vocoder. (2) We design an optimized diffusion transformer architecture specifically tailored for audio latent representations and diffusion modeling, which enhances convergence speed, training stability, and memory usage, making the training process easier and more efficient. (3) To tackle data scarcity, we adopt a data-efficient training strategy that leverages unlabeled data for learning acoustic dependencies, audio caption data annotated by audio-language models for text-to-audio alignment learning, and human-labeled data for fine-tuning. (4) We introduce a classifier-free guidance (CFG) rescaling method that simplifies EzAudio by achieving strong prompt alignment while preserving great audio quality when using larger CFG scores, eliminating the need to struggle with finding the optimal CFG score to balance this trade-off. EzAudio surpasses existing open-source models in both objective metrics and subjective evaluations, delivering realistic listening experiences while maintaining a streamlined model structure, low training costs, and an easy-to-follow training pipeline. Code, data, and pre-trained models are released at: https://haidog-yaqub.github.io/EzAudio-Page/.
Problem

Research questions and friction points this paper is trying to address.

Improving audio generation quality and naturalness via efficient diffusion transformers
Enhancing prompt adherence without fidelity loss using CFG rescaling
Boosting pretraining with synthetic captions from audio understanding and LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimized Diffusion Transformer for audio
Classifier-free guidance rescaling technique
Synthetic caption generation strategy
🔎 Similar Papers
Jiarui Hai
Jiarui Hai
Johns Hopkins University
computer auditiongenerative modelsmusic information retrieval
Y
Yong Xu
Tencent AI Lab, Bellevue, WA, USA
H
Hao Zhang
Tencent AI Lab, Bellevue, WA, USA
C
Chenxing Li
Tencent AI Lab, Bellevue, WA, USA
H
Helin Wang
Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD, USA
Mounya Elhilali
Mounya Elhilali
Professor of electrical and computer engineering, the johns hopkins university
D
Dong Yu
Tencent AI Lab, Bellevue, WA, USA