UniArray: Unified Spectral-Spatial Modeling for Array-Geometry-Agnostic Speech Separation

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

To address the high computational cost and insufficient spatial information exploitation caused by microphone array geometry dependence in speech separation, this paper proposes the first geometry-agnostic unified spectral-spatial modeling framework. Methodologically, it innovatively integrates virtual microphone estimation (VME), band-level spectral feature extraction, and spatial dictionary learning (SDL), while introducing a hierarchical dual-path convolutional separation network that eliminates permutation-invariant training constraints and supports arbitrary channel counts. The core contribution lies in achieving a synergistic breakthrough in geometry independence and efficient spatiotemporal modeling. Experiments demonstrate state-of-the-art performance across all major metrics—SI-SDR improvement (SI-SDRi), wideband PESQ (WB-PESQ), narrowband PESQ (NB-PESQ), and short-time objective intelligibility (STOI)—under both seen and unseen array geometries, confirming strong robustness and consistent high performance.

Technology Category

Application Category

📝 Abstract

Array-geometry-agnostic speech separation (AGA-SS) aims to develop an effective separation method regardless of the microphone array geometry. Conventional methods rely on permutation-free operations, such as summation or attention mechanisms, to capture spatial information. However, these approaches often incur high computational costs or disrupt the effective use of spatial information during intra- and inter-channel interactions, leading to suboptimal performance. To address these issues, we propose UniArray, a novel approach that abandons the conventional interleaving manner. UniArray consists of three key components: a virtual microphone estimation (VME) module, a feature extraction and fusion module, and a hierarchical dual-path separator. The VME ensures robust performance across arrays with varying channel numbers. The feature extraction and fusion module leverages a spectral feature extraction module and a spatial dictionary learning (SDL) module to extract and fuse frequency-bin-level features, allowing the separator to focus on using the fused features. The hierarchical dual-path separator models feature dependencies along the time and frequency axes while maintaining computational efficiency. Experimental results show that UniArray outperforms state-of-the-art methods in SI-SDRi, WB-PESQ, NB-PESQ, and STOI across both seen and unseen array geometries.

Problem

Research questions and friction points this paper is trying to address.

Develops array-geometry-agnostic speech separation method

Addresses high computational costs in spatial information capture

Improves performance across varying microphone array geometries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Virtual microphone estimation for array robustness

Spectral-spatial feature fusion via dictionary learning

Hierarchical dual-path separator for efficient modeling

🔎 Similar Papers

No similar papers found.