🤖 AI Summary
This work addresses melody-guided text-to-music generation by proposing a controllable diffusion framework that jointly leverages implicit semantic alignment and explicit melody retrieval. Methodologically: (1) it introduces Contrastive Language–Music Pretraining (CLMP), the first approach to jointly align textual, audio, and melodic representations; (2) it designs a retrieval-augmented melody-conditioned diffusion mechanism, integrating melody-aligned embeddings with a lightweight text–audio joint encoder to achieve high-fidelity generation within a minimal architectural footprint. Experiments demonstrate that the model outperforms all existing open-source methods on MusicCaps and MusicBench—despite using fewer than one-third the parameters and less than 0.5% of the training data. Human evaluation confirms significant superiority across five dimensions—including realism and melody consistency—validating its dual strengths in content fidelity and harmonic coherence.
📝 Abstract
We present the Melody-Guided Music Generation (MG2) model, a novel approach using melody to guide the text-to-music generation that, despite a simple method and limited resources, achieves excellent performance. Specifically, we first align the text with audio waveforms and their associated melodies using the newly proposed Contrastive Language-Music Pretraining, enabling the learned text representation fused with implicit melody information. Subsequently, we condition the retrieval-augmented diffusion module on both text prompt and retrieved melody. This allows MG2 to generate music that reflects the content of the given text description, meantime keeping the intrinsic harmony under the guidance of explicit melody information. We conducted extensive experiments on two public datasets: MusicCaps and MusicBench. Surprisingly, the experimental results demonstrate that the proposed MG2 model surpasses current open-source text-to-music generation models, achieving this with fewer than 1/3 of the parameters or less than 1/200 of the training data compared to state-of-the-art counterparts. Furthermore, we conducted comprehensive human evaluations involving three types of users and five perspectives, using newly designed questionnaires to explore the potential real-world applications of MG2.