Diffusion based Text-to-Music Generationwith Global and Local Text based Conditioning

📅 2025-01-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of weak conditional alignment, high model complexity, and difficulty in balancing generation quality in text-to-music (TTM) synthesis, this paper proposes a lightweight and efficient diffusion model. Methodologically: (1) it jointly incorporates local semantic representations from T5 with adaptively extracted global representations—via mean or self-attention pooling—eliminating the need for an additional encoder; alternatively, it fuses CLAP’s cross-modal global embeddings; (2) it introduces the FiLM mechanism into the diffusion U-Net for the first time in TTM, enabling fine-grained modulation by both global and local text conditions; (3) it enhances cross-modal interaction via cross-attention. Experiments demonstrate significant improvements: KL divergence drops to 1.47 (indicating markedly enhanced text–audio alignment), and Fréchet Audio Distance (FAD) reaches 1.89. Moreover, the model achieves substantial parameter reduction, striking a superior trade-off between generation fidelity and inference efficiency.

Technology Category

Application Category

📝 Abstract
Diffusion based Text-To-Music (TTM) models generate music corresponding to text descriptions. Typically UNet based diffusion models condition on text embeddings generated from a pre-trained large language model or from a cross-modality audio-language representation model. This work proposes a diffusion based TTM, in which the UNet is conditioned on both (i) a uni-modal language model (e.g., T5) via cross-attention and (ii) a cross-modal audio-language representation model (e.g., CLAP) via Feature-wise Linear Modulation (FiLM). The diffusion model is trained to exploit both a local text representation from the T5 and a global representation from the CLAP. Furthermore, we propose modifications that extract both global and local representations from the T5 through pooling mechanisms that we call mean pooling and self-attention pooling. This approach mitigates the need for an additional encoder (e.g., CLAP) to extract a global representation, thereby reducing the number of model parameters. Our results show that incorporating the CLAP global embeddings to the T5 local embeddings enhances text adherence (KL=1.47) compared to a baseline model solely relying on the T5 local embeddings (KL=1.54). Alternatively, extracting global text embeddings directly from the T5 local embeddings through the proposed mean pooling approach yields superior generation quality (FAD=1.89) while exhibiting marginally inferior text adherence (KL=1.51) against the model conditioned on both CLAP and T5 text embeddings (FAD=1.94 and KL=1.47). Our proposed solution is not only efficient but also compact in terms of the number of parameters required.
Problem

Research questions and friction points this paper is trying to address.

Text-to-Music Generation
Model Efficiency
Music Quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-to-Music
Integrated Modeling
Efficient Parameterization
🔎 Similar Papers
No similar papers found.