SmoothSinger: A Conditional Diffusion Model for Singing Voice Synthesis with Multi-Resolution Architecture

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Singing Voice Synthesis (SVS) requires precise modeling of pitch, duration, and phonetic articulation; however, existing diffusion-based approaches often introduce audible artifacts that degrade naturalness. To address this, we propose an end-to-end waveform-space conditional diffusion model, eliminating the distortion-prone two-stage architecture. Our key contributions are: (1) a reference-guided dual-branch U-Net, incorporating a parallel low-frequency upsampling branch to enhance acoustic detail reconstruction; and (2) replacing the reference audio with degraded ground-truth audio during training to mitigate temporal misalignment, thereby significantly improving pitch contour fidelity and long-range spectral dependency capture. Evaluated on the Opencpop dataset, our method achieves state-of-the-art performance in both objective metrics (e.g., MCD, F0 RMSE) and subjective listening tests (MOS), demonstrating reduced artificial artifacts, enhanced naturalness, and greater expressiveness. Ablation studies confirm the effectiveness of each proposed component.

Technology Category

Application Category

📝 Abstract

Singing voice synthesis (SVS) aims to generate expressive and high-quality vocals from musical scores, requiring precise modeling of pitch, duration, and articulation. While diffusion-based models have achieved remarkable success in image and video generation, their application to SVS remains challenging due to the complex acoustic and musical characteristics of singing, often resulting in artifacts that degrade naturalness. In this work, we propose SmoothSinger, a conditional diffusion model designed to synthesize high quality and natural singing voices. Unlike prior methods that depend on vocoders as a final stage and often introduce distortion, SmoothSinger refines low-quality synthesized audio directly in a unified framework, mitigating the degradation associated with two-stage pipelines. The model adopts a reference-guided dual-branch architecture, using low-quality audio from any baseline system as a reference to guide the denoising process, enabling more expressive and context-aware synthesis. Furthermore, it enhances the conventional U-Net with a parallel low-frequency upsampling path, allowing the model to better capture pitch contours and long term spectral dependencies. To improve alignment during training, we replace reference audio with degraded ground truth audio, addressing temporal mismatch between reference and target signals. Experiments on the Opencpop dataset, a large-scale Chinese singing corpus, demonstrate that SmoothSinger achieves state-of-the-art results in both objective and subjective evaluations. Extensive ablation studies confirm its effectiveness in reducing artifacts and improving the naturalness of synthesized voices.

Problem

Research questions and friction points this paper is trying to address.

Synthesize expressive singing voices from musical scores

Reduce artifacts in diffusion-based singing voice synthesis

Improve naturalness with unified framework and multi-resolution architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional diffusion model for natural singing synthesis

Reference-guided dual-branch architecture for expressive synthesis

Parallel low-frequency upsampling for pitch accuracy

🔎 Similar Papers

High-Resolution Speech Restoration with Latent Diffusion Model