Efficient Neural and Numerical Methods for High-Quality Online Speech Spectrogram Inversion via Gradient Theorem

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Phase reconstruction in online speech spectrogram inversion incurs high computational overhead, hindering real-time deployment. Method: This paper proposes a two-stage decoupled framework: (i) a lightweight CNN (8K parameters) predicts phase derivatives; (ii) least-squares phase integration is solved via the gradient theorem. To enhance efficiency, a 1-hop streaming latency halves inference delay, and a linear-complexity solver exploits tridiagonality and positive semi-definiteness of the system matrix. Contribution/Results: Integrating deep learning, differential-geometric constraints, and structured numerical optimization, the method achieves high-fidelity waveform reconstruction (STOI/ESTOI ≈ 0.95) while reducing neural inference cost by 30× and accelerating least-squares solving by 3–4 orders of magnitude—enabling, for the first time, high-quality, low-latency, fully differentiable online spectrogram inversion.

Technology Category

Application Category

📝 Abstract
Recent work in online speech spectrogram inversion effectively combines Deep Learning with the Gradient Theorem to predict phase derivatives directly from magnitudes. Then, phases are estimated from their derivatives via least squares, resulting in a high quality reconstruction. In this work, we introduce three innovations that drastically reduce computational cost, while maintaining high quality: Firstly, we introduce a novel neural network architecture with just 8k parameters, 30 times smaller than previous state of the art. Secondly, increasing latency by 1 hop size allows us to further halve the cost of the neural inference step. Thirdly, we we observe that the least squares problem features a tridiagonal matrix and propose a linear-complexity solver for the least squares step that leverages tridiagonality and positive-semidefiniteness, achieving a speedup of several orders of magnitude. We release samples online.
Problem

Research questions and friction points this paper is trying to address.

Develop efficient neural methods for speech spectrogram inversion
Reduce computational cost while maintaining reconstruction quality
Solve least squares problem with linear-complexity tridiagonal solver
Innovation

Methods, ideas, or system contributions that make the work stand out.

8k-parameter neural network for efficiency
Increased latency reduces neural inference cost
Linear-complexity solver for tridiagonal least squares
🔎 Similar Papers
No similar papers found.
Andres Fernandez
Andres Fernandez
PhD Candidate, University of Tübingen
Machine LearningSignal ProcessingOptimization
J
Juan Azcarreta
Reality Labs Research, Meta, UK
C
Cagdas Bilen
Reality Labs Research, Meta, UK
J
Jesus Monge Alvarez
Reality Labs Research, Meta, Spain