Pseudo-Cepstrum: Pitch Modification for Mel-Based Neural Vocoders

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing pitch modification methods for Mel-spectrum-driven neural vocoders rely either on fundamental frequency (F0) estimation or require model fine-tuning. To address these limitations, we propose a training-free, model-agnostic pitch shifting framework. Our approach first maps Mel spectrograms to a pseudo-cepstral domain via the pseudo-inverse Mel transform; then explicitly shifts harmonic peaks in the cepstral domain; and finally reconstructs the target Mel spectrogram through inverse discrete cosine transform (IDCT) and the Mel filterbank. This is the first method enabling universal, F0-estimation-free, and retraining-free pitch modification, implicitly modeling harmonic structure while overcoming accuracy and generalization bottlenecks inherent in conventional time-frequency domain approaches. Extensive evaluation on state-of-the-art vocoders—including HiFi-GAN and WaveGrad—demonstrates significant improvements in F0 root-mean-square error (RMSE) and Mel cepstral distortion (MCD) over PSOLA and WORLD, alongside marked gains in subjective Mean Opinion Score (MOS).

Technology Category

Application Category

📝 Abstract
This paper introduces a cepstrum-based pitch modification method that can be applied to any mel-spectrogram representation. As a result, this method is compatible with any mel-based vocoder without requiring any additional training or changes to the model. This is achieved by directly modifying the cepstrum feature space in order to shift the harmonic structure to the desired target. The spectrogram magnitude is computed via the pseudo-inverse mel transform, then converted to the cepstrum by applying DCT. In this domain, the cepstral peak is shifted without having to estimate its position and the modified mel is recomputed by applying IDCT and mel-filterbank. These pitch-shifted mel-spectrogram features can be converted to speech with any compatible vocoder. The proposed method is validated experimentally with objective and subjective metrics on various state-of-the-art neural vocoders as well as in comparison with traditional pitch modification methods.
Problem

Research questions and friction points this paper is trying to address.

Pitch modification for mel-based neural vocoders
Compatible with any mel-based vocoder without retraining
Shifts harmonic structure via cepstrum feature space
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cepstrum-based pitch modification for mel-spectrograms
Compatible with any mel-based vocoder without retraining
Direct harmonic structure shift via cepstral peak manipulation
🔎 Similar Papers
No similar papers found.
Nikolaos Ellinas
Nikolaos Ellinas
Innoetics, Samsung Electronics, Greece
A
Alexandra Vioni
Innoetics, Samsung Electronics, Greece
Panos Kakoulidis
Panos Kakoulidis
Samsung Electronics, National and Kapodistrian University of Athens
Machine LearningBioinformaticsCheminformaticsStructural BiologyHuman-Computer Interaction
G
Georgios Vamvoukakis
Innoetics, Samsung Electronics, Greece
M
Myrsini Christidou
Innoetics, Samsung Electronics, Greece
K
Konstantinos Markopoulos
Innoetics, Samsung Electronics, Greece
J
Junkwang Oh
Mobile eXperience Business, Samsung Electronics, Republic of Korea
G
Gunu Jho
Mobile eXperience Business, Samsung Electronics, Republic of Korea
Inchul Hwang
Inchul Hwang
Samsung Electronics
Artificial Intelligence
Aimilios Chalamandaris
Aimilios Chalamandaris
Samsung Electronics | innoetics
Text to SpeechTTSSpeech ProcessingSpeech RecognitionExpressive TTS
Pirros Tsiakoulis
Pirros Tsiakoulis
Innoetics
Speech Synthesis