Neural Speech Separation with Parallel Amplitude and Phase Spectrum Estimation

๐Ÿ“… 2025-09-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Most existing speech separation methods neglect explicit phase spectrum modeling, leading to incomplete time-frequency reconstruction and limited fidelity. To address this, we propose a novel neural separation model that jointly estimates magnitude and phase spectra in parallel within an end-to-end frameworkโ€”the first to explicitly co-model both components. Our architecture integrates deep feature fusion with a time-frequency Transformer to capture long-range temporal and spectral dependencies, while a dual-branch parallel network separately optimizes magnitude and phase prediction. This design avoids error accumulation inherent in conventional implicit phase recovery or post-hoc phase estimation. Evaluated on standard benchmarks (WSJ0-2mix, Libri2Mix), our method significantly outperforms state-of-the-art time-domain and implicit-phase approaches, achieving higher SI-SNR improvement (SI-SNRi), enhanced speech intelligibility, and superior generalization and robustness.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper proposes APSS, a novel neural speech separation model with parallel amplitude and phase spectrum estimation. Unlike most existing speech separation methods, the APSS distinguishes itself by explicitly estimating the phase spectrum for more complete and accurate separation. Specifically, APSS first extracts the amplitude and phase spectra from the mixed speech signal. Subsequently, the extracted amplitude and phase spectra are fused by a feature combiner into joint representations, which are then further processed by a deep processor with time-frequency Transformers to capture temporal and spectral dependencies. Finally, leveraging parallel amplitude and phase separators, the APSS estimates the respective spectra for each speaker from the resulting features, which are then combined via inverse short-time Fourier transform (iSTFT) to reconstruct the separated speech signals. Experimental results indicate that APSS surpasses both time-domain separation methods and implicit-phase-estimation-based time-frequency approaches. Also, APSS achieves stable and competitive results on multiple datasets, highlighting its strong generalization capability and practical applicability.
Problem

Research questions and friction points this paper is trying to address.

Explicitly estimates phase spectrum for accurate speech separation
Fuses amplitude and phase spectra into joint representations
Reconstructs separated speech signals via parallel spectrum estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel amplitude and phase spectrum estimation
Feature fusion with time-frequency Transformers processing
Inverse short-time Fourier transform signal reconstruction
๐Ÿ”Ž Similar Papers
No similar papers found.
F
Fei Liu
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, Hefei, China
Yang Ai
Yang Ai
Associate Researcher, University of Science and Technology of China
Speech SynthesisSpeech EnhancementSpeech CodingDeep Learning
Z
Zhen-Hua Ling
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, Hefei, China