Melody-Guided Music Generation

📅 2024-09-30
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses melody-guided text-to-music generation by proposing a controllable diffusion framework that jointly leverages implicit semantic alignment and explicit melody retrieval. Methodologically: (1) it introduces Contrastive Language–Music Pretraining (CLMP), the first approach to jointly align textual, audio, and melodic representations; (2) it designs a retrieval-augmented melody-conditioned diffusion mechanism, integrating melody-aligned embeddings with a lightweight text–audio joint encoder to achieve high-fidelity generation within a minimal architectural footprint. Experiments demonstrate that the model outperforms all existing open-source methods on MusicCaps and MusicBench—despite using fewer than one-third the parameters and less than 0.5% of the training data. Human evaluation confirms significant superiority across five dimensions—including realism and melody consistency—validating its dual strengths in content fidelity and harmonic coherence.

Technology Category

Application Category

📝 Abstract
We present the Melody-Guided Music Generation (MG2) model, a novel approach using melody to guide the text-to-music generation that, despite a simple method and limited resources, achieves excellent performance. Specifically, we first align the text with audio waveforms and their associated melodies using the newly proposed Contrastive Language-Music Pretraining, enabling the learned text representation fused with implicit melody information. Subsequently, we condition the retrieval-augmented diffusion module on both text prompt and retrieved melody. This allows MG2 to generate music that reflects the content of the given text description, meantime keeping the intrinsic harmony under the guidance of explicit melody information. We conducted extensive experiments on two public datasets: MusicCaps and MusicBench. Surprisingly, the experimental results demonstrate that the proposed MG2 model surpasses current open-source text-to-music generation models, achieving this with fewer than 1/3 of the parameters or less than 1/200 of the training data compared to state-of-the-art counterparts. Furthermore, we conducted comprehensive human evaluations involving three types of users and five perspectives, using newly designed questionnaires to explore the potential real-world applications of MG2.
Problem

Research questions and friction points this paper is trying to address.

Melody Generation
Text-to-Music
Quality and Relevance
Innovation

Methods, ideas, or system contributions that make the work stand out.

MG2
melody-driven text-to-music generation
resource-efficient model
🔎 Similar Papers
No similar papers found.
S
Shaopeng Wei
School of Business, Guangxi University, China
M
Manzhen Wei
School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, China
H
Haoyu Wang
School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, China
Y
Yu Zhao
School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, China
Gang Kou
Gang Kou
SWUFE 西南财经大学
Multiple criteria decision makingData miningAHPGroup decision makingOpinion mining