🤖 AI Summary
Existing song generation methods lack zero-shot speaker cloning capabilities, while voice conversion approaches typically neglect the joint generation of vocals and accompaniment. To address these limitations, this work proposes UniSinger—the first end-to-end unified framework that simultaneously supports accompaniment-aware speaker-cloned singing voice generation and singing voice conversion. Built upon a multimodal diffusion Transformer, UniSinger constructs a unified speaker embedding space and introduces task-specific modality masking alongside a curriculum learning strategy to harmonize multi-task optimization and mitigate task interference. The model achieves state-of-the-art performance on both tasks, enabling for the first time cross-task timbre control and mutual performance gains, thereby advancing the frontier of intelligent music generation.
📝 Abstract
While song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lacks zero-shot speaker cloning, while the latter overlooks vocal-accompaniment synergy. To bridge this gap, we propose UniSinger, the first end-to-end framework unifying speaker cloning song generation and accompaniment co-generation SVC. Building on the multimodal diffusion transformer, we construct a unified speaker embedding space transferring speaker representation from SVC to song generation, endowing fine-grained cross-task timbre control. To mitigate multi-task optimization conflicts, we design a curriculum learning strategy using task-specific modality masking to guide the model to gradually master the generative mechanisms among semantic content, vocal timbre, and accompaniment. Experiments show state-of-the-art performance on both tasks and realizes complementary benefits, offering new possibilities for intelligent music production.