Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing song generation methods lack zero-shot speaker cloning capabilities, while voice conversion approaches typically neglect the joint generation of vocals and accompaniment. To address these limitations, this work proposes UniSinger—the first end-to-end unified framework that simultaneously supports accompaniment-aware speaker-cloned singing voice generation and singing voice conversion. Built upon a multimodal diffusion Transformer, UniSinger constructs a unified speaker embedding space and introduces task-specific modality masking alongside a curriculum learning strategy to harmonize multi-task optimization and mitigate task interference. The model achieves state-of-the-art performance on both tasks, enabling for the first time cross-task timbre control and mutual performance gains, thereby advancing the frontier of intelligent music generation.

📝 Abstract

While song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lacks zero-shot speaker cloning, while the latter overlooks vocal-accompaniment synergy. To bridge this gap, we propose UniSinger, the first end-to-end framework unifying speaker cloning song generation and accompaniment co-generation SVC. Building on the multimodal diffusion transformer, we construct a unified speaker embedding space transferring speaker representation from SVC to song generation, endowing fine-grained cross-task timbre control. To mitigate multi-task optimization conflicts, we design a curriculum learning strategy using task-specific modality masking to guide the model to gradually master the generative mechanisms among semantic content, vocal timbre, and accompaniment. Experiments show state-of-the-art performance on both tasks and realizes complementary benefits, offering new possibilities for intelligent music production.

Problem

Research questions and friction points this paper is trying to address.

song generation

singing voice conversion

speaker cloning

accompaniment co-generation

vocal-accompaniment synergy

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified song generation

singing voice conversion

accompaniment co-generation