InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of jointly ensuring coherence, fine-grained detail reconstruction, and training efficiency in high-fidelity, long-duration (up to 8 minutes), high-sample-rate (48 kHz) music generation, this paper introduces InspireMusic-1.5B-Long. Methodologically, it is the first to unify a Qwen-2.5 autoregressive Transformer with a super-resolution flow-matching model for controllable text-to-audio generation. A single-codebook, semantically rich audio tokenizer is designed to substantially reduce computational overhead. Furthermore, acoustic encoder-decoder architecture and end-to-end joint optimization are employed to preserve long-range structural consistency and high-frequency fidelity. Objective and subjective evaluations demonstrate that InspireMusic-1.5B-Long achieves audio quality and temporal coherence comparable to MusicGen and Stable Audio 2.0, while reducing training cost by approximately 35%. This work establishes a novel paradigm for efficient, high-quality, long-form music synthesis.

Technology Category

Application Category

📝 Abstract
We introduce InspireMusic, a framework integrated super resolution and large language model for high-fidelity long-form music generation. A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model. This framework enables the controllable generation of high-fidelity long-form music at a higher sampling rate from both text and audio prompts. Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information, thereby reducing training costs and enhancing efficiency. This combination enables us to achieve high-quality audio generation with long-form coherence of up to $8$ minutes. Then, an autoregressive transformer model based on Qwen 2.5 predicts audio tokens. Next, we employ a super-resolution flow-matching model to generate high-sampling rate audio with fine-grained details learned from an acoustic codec model. Comprehensive experiments show that the InspireMusic-1.5B-Long model has a comparable performance to recent top-tier open-source systems, including MusicGen and Stable Audio 2.0, on subjective and objective evaluations. The code and pre-trained models are released at https://github.com/FunAudioLLM/InspireMusic.
Problem

Research questions and friction points this paper is trying to address.

High-fidelity long-form music generation from text and audio prompts.
Integration of super-resolution and large language models for music synthesis.
Efficient training with a single audio tokenizer for rich semantic information.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines super-resolution and large language models
Uses single codebook audio tokenizer for efficiency
Generates high-fidelity music up to 8 minutes
🔎 Similar Papers
No similar papers found.
C
Chong Zhang
Tongyi Lab, Alibaba Group
Yukun Ma
Yukun Ma
Alibaba Group
ASRSLU
Q
Qian Chen
Tongyi Lab, Alibaba Group
W
Wen Wang
Tongyi Lab, Alibaba Group
Shengkui Zhao
Shengkui Zhao
Senior Algorithm Expert, Alibaba Group
Speech processing and large models
Z
Zexu Pan
Tongyi Lab, Alibaba Group
H
Hao Wang
Tongyi Lab, Alibaba Group
C
Chongjia Ni
Tongyi Lab, Alibaba Group
T
Trung Hieu Nguyen
Tongyi Lab, Alibaba Group
K
Kun Zhou
Tongyi Lab, Alibaba Group
Yidi Jiang
Yidi Jiang
Ph.D., National University of Singapore
MultimodalMachine LearningSpeech Processing
C
Chaohong Tan
Tongyi Lab, Alibaba Group
Z
Zhifu Gao
Tongyi Lab, Alibaba Group
Zhihao Du
Zhihao Du
Alibaba
Speech separationspeech enchancementspeaker diarization
B
Bin Ma
Tongyi Lab, Alibaba Group