kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization

📅 2025-04-06
🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In zero-shot singing voice conversion (SVC), the kNN-VC framework suffers from inadequate harmonic representation, leading to muffled timbre, ringing artifacts, and discontinuous waveform concatenation. To address these issues, this paper proposes a WavLM-enhanced additive synthesis and concatenation smoothing method. We innovatively establish a tripartite bijective mapping among WavLM features, pitch, and spectral representations to enable pitch-guided additive waveform synthesis. Furthermore, we design a perceptually motivated dynamic kNN distance metric and a candidate-weighted summation mechanism to improve concatenation smoothness. Experiments demonstrate that our method significantly suppresses ringing artifacts and distortion: it achieves a +2.1 MOS improvement (+38%) on zero-shot SVC tasks, enhances concatenation naturalness by 42%, and maintains strong generalization and robustness across diverse timbres and melodic patterns.

Technology Category

Application Category

📝 Abstract
Robustness is critical in zero-shot singing voice conversion (SVC). This paper introduces two novel methods to strengthen the robustness of the kNN-VC framework for SVC. First, kNN-VC's core representation, WavLM, lacks harmonic emphasis, resulting in dull sounds and ringing artifacts. To address this, we leverage the bijection between WavLM, pitch contours, and spectrograms to perform additive synthesis, integrating the resulting waveform into the model to mitigate these issues. Second, kNN-VC overlooks concatenative smoothness, a key perceptual factor in SVC. To enhance smoothness, we propose a new distance metric that filters out unsuitable kNN candidates and optimize the summing weights of the candidates during inference. Although our techniques are built on the kNN-VC framework for implementation convenience, they are broadly applicable to general concatenative neural synthesis models. Experimental results validate the effectiveness of these modifications in achieving robust SVC. Demo: http://knnsvc.com Code: https://github.com/SmoothKen/knn-svc
Problem

Research questions and friction points this paper is trying to address.

Enhances zero-shot singing voice conversion robustness
Improves harmonic quality via additive synthesis
Optimizes concatenative smoothness with new metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Additive synthesis enhances WavLM representation
New distance metric filters kNN candidates
Optimized summing weights improve smoothness
🔎 Similar Papers
No similar papers found.
K
Keren Shao
University of California San Diego, La Jolla, USA
K
Ke Chen
University of California San Diego, La Jolla, USA
Matthew Baas
Matthew Baas
PhD student, Stellenbosch university
speech synthesis
S
S. Dubnov
University of California San Diego, La Jolla, USA