PitchFlower: A flow-based neural audio codec with pitch controllability

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

To address the low pitch control accuracy and strong acoustic feature coupling in neural audio codecs, this paper proposes a flow-based disentangled speech codec. Methodologically, it introduces explicit F0-conditioned input and applies flattening and random offset perturbations to the F0 contour during training; combined with a vector-quantized bottleneck and a flow-based decoder, it effectively disentangles pitch from other acoustic attributes. Experiments demonstrate that the model achieves synthesis quality comparable to WORLD and SiFiGAN while significantly improving pitch controllability and robustness. Moreover, the architecture provides a scalable framework for further disentangling additional prosodic attributes—such as duration and loudness—enabling fine-grained, attribute-specific speech manipulation.

Technology Category

Application Category

📝 Abstract

We present PitchFlower, a flow-based neural audio codec with explicit pitch controllability. Our approach enforces disentanglement through a simple perturbation: during training, F0 contours are flattened and randomly shifted, while the true F0 is provided as conditioning. A vector-quantization bottleneck prevents pitch recovery, and a flow-based decoder generates high quality audio. Experiments show that PitchFlower achieves more accurate pitch control than WORLD at much higher audio quality, and outperforms SiFiGAN in controllability while maintaining comparable quality. Beyond pitch, this framework provides a simple and extensible path toward disentangling other speech attributes.

Problem

Research questions and friction points this paper is trying to address.

Developing a neural audio codec with explicit pitch controllability

Enhancing audio quality while maintaining precise pitch manipulation

Providing extensible framework for disentangling multiple speech attributes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow-based neural audio codec with pitch control

F0 perturbation and conditioning for disentanglement

Vector-quantization bottleneck prevents pitch recovery

🔎 Similar Papers

FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates