kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization

📅 2025-04-06

🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

In zero-shot singing voice conversion (SVC), the kNN-VC framework suffers from inadequate harmonic representation, leading to muffled timbre, ringing artifacts, and discontinuous waveform concatenation. To address these issues, this paper proposes a WavLM-enhanced additive synthesis and concatenation smoothing method. We innovatively establish a tripartite bijective mapping among WavLM features, pitch, and spectral representations to enable pitch-guided additive waveform synthesis. Furthermore, we design a perceptually motivated dynamic kNN distance metric and a candidate-weighted summation mechanism to improve concatenation smoothness. Experiments demonstrate that our method significantly suppresses ringing artifacts and distortion: it achieves a +2.1 MOS improvement (+38%) on zero-shot SVC tasks, enhances concatenation naturalness by 42%, and maintains strong generalization and robustness across diverse timbres and melodic patterns.

Technology Category

Application Category

📝 Abstract

Robustness is critical in zero-shot singing voice conversion (SVC). This paper introduces two novel methods to strengthen the robustness of the kNN-VC framework for SVC. First, kNN-VC's core representation, WavLM, lacks harmonic emphasis, resulting in dull sounds and ringing artifacts. To address this, we leverage the bijection between WavLM, pitch contours, and spectrograms to perform additive synthesis, integrating the resulting waveform into the model to mitigate these issues. Second, kNN-VC overlooks concatenative smoothness, a key perceptual factor in SVC. To enhance smoothness, we propose a new distance metric that filters out unsuitable kNN candidates and optimize the summing weights of the candidates during inference. Although our techniques are built on the kNN-VC framework for implementation convenience, they are broadly applicable to general concatenative neural synthesis models. Experimental results validate the effectiveness of these modifications in achieving robust SVC. Demo: http://knnsvc.com Code: https://github.com/SmoothKen/knn-svc

Problem

Research questions and friction points this paper is trying to address.

Enhances zero-shot singing voice conversion robustness

Improves harmonic quality via additive synthesis

Optimizes concatenative smoothness with new metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Additive synthesis enhances WavLM representation

New distance metric filters kNN candidates

Optimized summing weights improve smoothness

🔎 Similar Papers

No similar papers found.