🤖 AI Summary
Phase reconstruction in online speech spectrogram inversion incurs high computational overhead, hindering real-time deployment. Method: This paper proposes a two-stage decoupled framework: (i) a lightweight CNN (8K parameters) predicts phase derivatives; (ii) least-squares phase integration is solved via the gradient theorem. To enhance efficiency, a 1-hop streaming latency halves inference delay, and a linear-complexity solver exploits tridiagonality and positive semi-definiteness of the system matrix. Contribution/Results: Integrating deep learning, differential-geometric constraints, and structured numerical optimization, the method achieves high-fidelity waveform reconstruction (STOI/ESTOI ≈ 0.95) while reducing neural inference cost by 30× and accelerating least-squares solving by 3–4 orders of magnitude—enabling, for the first time, high-quality, low-latency, fully differentiable online spectrogram inversion.
📝 Abstract
Recent work in online speech spectrogram inversion effectively combines Deep Learning with the Gradient Theorem to predict phase derivatives directly from magnitudes. Then, phases are estimated from their derivatives via least squares, resulting in a high quality reconstruction. In this work, we introduce three innovations that drastically reduce computational cost, while maintaining high quality: Firstly, we introduce a novel neural network architecture with just 8k parameters, 30 times smaller than previous state of the art. Secondly, increasing latency by 1 hop size allows us to further halve the cost of the neural inference step. Thirdly, we we observe that the least squares problem features a tridiagonal matrix and propose a linear-complexity solver for the least squares step that leverages tridiagonality and positive-semidefiniteness, achieving a speedup of several orders of magnitude. We release samples online.