UniArray: Unified Spectral-Spatial Modeling for Array-Geometry-Agnostic Speech Separation

๐Ÿ“… 2025-03-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the high computational cost and insufficient spatial information exploitation caused by microphone array geometry dependence in speech separation, this paper proposes the first geometry-agnostic unified spectral-spatial modeling framework. Methodologically, it innovatively integrates virtual microphone estimation (VME), band-level spectral feature extraction, and spatial dictionary learning (SDL), while introducing a hierarchical dual-path convolutional separation network that eliminates permutation-invariant training constraints and supports arbitrary channel counts. The core contribution lies in achieving a synergistic breakthrough in geometry independence and efficient spatiotemporal modeling. Experiments demonstrate state-of-the-art performance across all major metricsโ€”SI-SDR improvement (SI-SDRi), wideband PESQ (WB-PESQ), narrowband PESQ (NB-PESQ), and short-time objective intelligibility (STOI)โ€”under both seen and unseen array geometries, confirming strong robustness and consistent high performance.

Technology Category

Application Category

๐Ÿ“ Abstract
Array-geometry-agnostic speech separation (AGA-SS) aims to develop an effective separation method regardless of the microphone array geometry. Conventional methods rely on permutation-free operations, such as summation or attention mechanisms, to capture spatial information. However, these approaches often incur high computational costs or disrupt the effective use of spatial information during intra- and inter-channel interactions, leading to suboptimal performance. To address these issues, we propose UniArray, a novel approach that abandons the conventional interleaving manner. UniArray consists of three key components: a virtual microphone estimation (VME) module, a feature extraction and fusion module, and a hierarchical dual-path separator. The VME ensures robust performance across arrays with varying channel numbers. The feature extraction and fusion module leverages a spectral feature extraction module and a spatial dictionary learning (SDL) module to extract and fuse frequency-bin-level features, allowing the separator to focus on using the fused features. The hierarchical dual-path separator models feature dependencies along the time and frequency axes while maintaining computational efficiency. Experimental results show that UniArray outperforms state-of-the-art methods in SI-SDRi, WB-PESQ, NB-PESQ, and STOI across both seen and unseen array geometries.
Problem

Research questions and friction points this paper is trying to address.

Develops array-geometry-agnostic speech separation method
Addresses high computational costs in spatial information capture
Improves performance across varying microphone array geometries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Virtual microphone estimation for array robustness
Spectral-spatial feature fusion via dictionary learning
Hierarchical dual-path separator for efficient modeling
๐Ÿ”Ž Similar Papers
No similar papers found.
W
Weiguang Chen
College of Computer Science and Electronic Engineering, Hunan University, Hunan, China
J
Junjie Zhang
College of Computer Science and Electronic Engineering, Hunan University, Hunan, China
Jielong Yang
Jielong Yang
Nanyang Technological University
statistical machine learning
E
E. Chng
School of Computer Science and Engineering, Nanyang Technological University, Singapore
X
Xionghu Zhong
College of Computer Science and Electronic Engineering, Hunan University, Hunan, China