๐ค AI Summary
To address the high computational cost and insufficient spatial information exploitation caused by microphone array geometry dependence in speech separation, this paper proposes the first geometry-agnostic unified spectral-spatial modeling framework. Methodologically, it innovatively integrates virtual microphone estimation (VME), band-level spectral feature extraction, and spatial dictionary learning (SDL), while introducing a hierarchical dual-path convolutional separation network that eliminates permutation-invariant training constraints and supports arbitrary channel counts. The core contribution lies in achieving a synergistic breakthrough in geometry independence and efficient spatiotemporal modeling. Experiments demonstrate state-of-the-art performance across all major metricsโSI-SDR improvement (SI-SDRi), wideband PESQ (WB-PESQ), narrowband PESQ (NB-PESQ), and short-time objective intelligibility (STOI)โunder both seen and unseen array geometries, confirming strong robustness and consistent high performance.
๐ Abstract
Array-geometry-agnostic speech separation (AGA-SS) aims to develop an effective separation method regardless of the microphone array geometry. Conventional methods rely on permutation-free operations, such as summation or attention mechanisms, to capture spatial information. However, these approaches often incur high computational costs or disrupt the effective use of spatial information during intra- and inter-channel interactions, leading to suboptimal performance. To address these issues, we propose UniArray, a novel approach that abandons the conventional interleaving manner. UniArray consists of three key components: a virtual microphone estimation (VME) module, a feature extraction and fusion module, and a hierarchical dual-path separator. The VME ensures robust performance across arrays with varying channel numbers. The feature extraction and fusion module leverages a spectral feature extraction module and a spatial dictionary learning (SDL) module to extract and fuse frequency-bin-level features, allowing the separator to focus on using the fused features. The hierarchical dual-path separator models feature dependencies along the time and frequency axes while maintaining computational efficiency. Experimental results show that UniArray outperforms state-of-the-art methods in SI-SDRi, WB-PESQ, NB-PESQ, and STOI across both seen and unseen array geometries.