FCPE: A Fast Context-based Pitch Estimation Model

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the significant degradation in pitch estimation (PE) performance for monaural audio under noisy conditions, this paper proposes a lightweight, context-aware PE algorithm optimized for MIDI transcription and singing voice conversion (SVC). Methodologically, we adopt the Lynx-Net architecture and integrate depthwise separable convolutions to markedly reduce computational cost while enhancing noise robustness; Mel-spectrogram features serve as input for efficient time-frequency modeling. Evaluated on the MIR-1K dataset, our method achieves a Raw Pitch Accuracy of 96.79% and operates at a real-time factor of 0.0062 (≈161× real-time) on a single RTX 4090 GPU—outperforming state-of-the-art approaches in both inference speed and accuracy. Our key contribution is the first application of Lynx-Net combined with depthwise separable convolutions to robust PE, achieving an optimal trade-off among high accuracy, ultra-low latency, and strong generalization across diverse acoustic conditions.

Technology Category

Application Category

📝 Abstract
Pitch estimation (PE) in monophonic audio is crucial for MIDI transcription and singing voice conversion (SVC), but existing methods suffer significant performance degradation under noise. In this paper, we propose FCPE, a fast context-based pitch estimation model that employs a Lynx-Net architecture with depth-wise separable convolutions to effectively capture mel spectrogram features while maintaining low computational cost and robust noise tolerance. Experiments show that our method achieves 96.79% Raw Pitch Accuracy (RPA) on the MIR-1K dataset, on par with the state-of-the-art methods. The Real-Time Factor (RTF) is 0.0062 on a single RTX 4090 GPU, which significantly outperforms existing algorithms in efficiency. Code is available at https://github.com/CNChTu/FCPE.
Problem

Research questions and friction points this paper is trying to address.

Robust monophonic pitch estimation under noisy conditions
Fast computational performance for real-time applications
Accurate MIDI transcription and singing voice conversion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lynx-Net architecture with depth-wise convolutions
Extracts mel spectrogram features efficiently
Maintains low computational cost and noise robustness
🔎 Similar Papers
No similar papers found.
Y
Yuxin Luo
Fish Audio, Santa Clara, CA, USA
R
Ruoyi Zhang
Fish Audio, Santa Clara, CA, USA
L
Lu-Chuan Liu
University of Science and Technology of China, Hefei, Anhui, China
T
Tianyu Li
Fish Audio, Santa Clara, CA, USA
Hangyu Liu
Hangyu Liu
Beijing University of Posts and Telecommunications
Large Language ModelEmbodied AI