Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

📅 2026-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing song generation methods lack zero-shot speaker cloning capabilities, while voice conversion approaches typically neglect the joint generation of vocals and accompaniment. To address these limitations, this work proposes UniSinger—the first end-to-end unified framework that simultaneously supports accompaniment-aware speaker-cloned singing voice generation and singing voice conversion. Built upon a multimodal diffusion Transformer, UniSinger constructs a unified speaker embedding space and introduces task-specific modality masking alongside a curriculum learning strategy to harmonize multi-task optimization and mitigate task interference. The model achieves state-of-the-art performance on both tasks, enabling for the first time cross-task timbre control and mutual performance gains, thereby advancing the frontier of intelligent music generation.
📝 Abstract
While song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lacks zero-shot speaker cloning, while the latter overlooks vocal-accompaniment synergy. To bridge this gap, we propose UniSinger, the first end-to-end framework unifying speaker cloning song generation and accompaniment co-generation SVC. Building on the multimodal diffusion transformer, we construct a unified speaker embedding space transferring speaker representation from SVC to song generation, endowing fine-grained cross-task timbre control. To mitigate multi-task optimization conflicts, we design a curriculum learning strategy using task-specific modality masking to guide the model to gradually master the generative mechanisms among semantic content, vocal timbre, and accompaniment. Experiments show state-of-the-art performance on both tasks and realizes complementary benefits, offering new possibilities for intelligent music production.
Problem

Research questions and friction points this paper is trying to address.

song generation
singing voice conversion
speaker cloning
accompaniment co-generation
vocal-accompaniment synergy
Innovation

Methods, ideas, or system contributions that make the work stand out.

unified song generation
singing voice conversion
accompaniment co-generation
multimodal diffusion transformer
curriculum learning
🔎 Similar Papers
No similar papers found.
Z
Ziyu Zhang
Northwestern Polytechnical University, China
Chunyu Qiang
Chunyu Qiang
Kuaishou Technology; TJU; CASIA
Speech Synthesis
Xiaopeng Wang
Xiaopeng Wang
Institute of Automation, Chinese Academy of Sciences
Fake Audio DetectionText To SpeechSpeech Large Model
Yuxin Guo
Yuxin Guo
Institute of Automation, Chinese Academy of Sciences
MLLMMulti-modal LearningAudio-Visual Learning
K
Kang Yin
University of Science and Technology of China, China
Wenjie Tian
Wenjie Tian
Northwest Polytechnical University
speech generation
J
Jingbin Hu
Northwestern Polytechnical University, China
T
Tianlun Zuo
Northwestern Polytechnical University, China
Z
Zhao Guo
Northwestern Polytechnical University, China
T
Teng Ma
Kuaishou Technology, China
Yuzhe Liang
Yuzhe Liang
Shanghai Jiao Tong University
Deep learningMultimodal Learning
C
Chen Zhang
Kuaishou Technology, China
Lei Xie
Lei Xie
Northwestern Polytechnical University
speech processingspeech recognitionspeech synthesismultimediaartificial intelligence