🤖 AI Summary
Singing Voice Synthesis (SVS) requires precise modeling of pitch, duration, and phonetic articulation; however, existing diffusion-based approaches often introduce audible artifacts that degrade naturalness. To address this, we propose an end-to-end waveform-space conditional diffusion model, eliminating the distortion-prone two-stage architecture. Our key contributions are: (1) a reference-guided dual-branch U-Net, incorporating a parallel low-frequency upsampling branch to enhance acoustic detail reconstruction; and (2) replacing the reference audio with degraded ground-truth audio during training to mitigate temporal misalignment, thereby significantly improving pitch contour fidelity and long-range spectral dependency capture. Evaluated on the Opencpop dataset, our method achieves state-of-the-art performance in both objective metrics (e.g., MCD, F0 RMSE) and subjective listening tests (MOS), demonstrating reduced artificial artifacts, enhanced naturalness, and greater expressiveness. Ablation studies confirm the effectiveness of each proposed component.
📝 Abstract
Singing voice synthesis (SVS) aims to generate expressive and high-quality vocals from musical scores, requiring precise modeling of pitch, duration, and articulation. While diffusion-based models have achieved remarkable success in image and video generation, their application to SVS remains challenging due to the complex acoustic and musical characteristics of singing, often resulting in artifacts that degrade naturalness. In this work, we propose SmoothSinger, a conditional diffusion model designed to synthesize high quality and natural singing voices. Unlike prior methods that depend on vocoders as a final stage and often introduce distortion, SmoothSinger refines low-quality synthesized audio directly in a unified framework, mitigating the degradation associated with two-stage pipelines. The model adopts a reference-guided dual-branch architecture, using low-quality audio from any baseline system as a reference to guide the denoising process, enabling more expressive and context-aware synthesis. Furthermore, it enhances the conventional U-Net with a parallel low-frequency upsampling path, allowing the model to better capture pitch contours and long term spectral dependencies. To improve alignment during training, we replace reference audio with degraded ground truth audio, addressing temporal mismatch between reference and target signals. Experiments on the Opencpop dataset, a large-scale Chinese singing corpus, demonstrate that SmoothSinger achieves state-of-the-art results in both objective and subjective evaluations. Extensive ablation studies confirm its effectiveness in reducing artifacts and improving the naturalness of synthesized voices.