Efficient Neural and Numerical Methods for High-Quality Online Speech Spectrogram Inversion via Gradient Theorem

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Phase reconstruction in online speech spectrogram inversion incurs high computational overhead, hindering real-time deployment. Method: This paper proposes a two-stage decoupled framework: (i) a lightweight CNN (8K parameters) predicts phase derivatives; (ii) least-squares phase integration is solved via the gradient theorem. To enhance efficiency, a 1-hop streaming latency halves inference delay, and a linear-complexity solver exploits tridiagonality and positive semi-definiteness of the system matrix. Contribution/Results: Integrating deep learning, differential-geometric constraints, and structured numerical optimization, the method achieves high-fidelity waveform reconstruction (STOI/ESTOI ≈ 0.95) while reducing neural inference cost by 30× and accelerating least-squares solving by 3–4 orders of magnitude—enabling, for the first time, high-quality, low-latency, fully differentiable online spectrogram inversion.

Technology Category

Application Category

📝 Abstract

Recent work in online speech spectrogram inversion effectively combines Deep Learning with the Gradient Theorem to predict phase derivatives directly from magnitudes. Then, phases are estimated from their derivatives via least squares, resulting in a high quality reconstruction. In this work, we introduce three innovations that drastically reduce computational cost, while maintaining high quality: Firstly, we introduce a novel neural network architecture with just 8k parameters, 30 times smaller than previous state of the art. Secondly, increasing latency by 1 hop size allows us to further halve the cost of the neural inference step. Thirdly, we we observe that the least squares problem features a tridiagonal matrix and propose a linear-complexity solver for the least squares step that leverages tridiagonality and positive-semidefiniteness, achieving a speedup of several orders of magnitude. We release samples online.

Problem

Research questions and friction points this paper is trying to address.

Develop efficient neural methods for speech spectrogram inversion

Reduce computational cost while maintaining reconstruction quality

Solve least squares problem with linear-complexity tridiagonal solver

Innovation

Methods, ideas, or system contributions that make the work stand out.

8k-parameter neural network for efficiency

Increased latency reduces neural inference cost

Linear-complexity solver for tridiagonal least squares

🔎 Similar Papers

Comparison Performance of Spectrogram and Scalogram as Input of Acoustic Recognition Task