🤖 AI Summary
This work addresses the insufficient modeling of target time-frequency (TF) units in single-channel real-time speech enhancement. We propose HDF-Net, a two-level hierarchical deep filtering framework. Methodologically, it introduces a novel time–frequency decoupled two-stage deep filtering mechanism; incorporates a lightweight TAConv module to enhance local TF feature extraction; and employs a hierarchical network architecture to jointly model target TF bins and their contextual neighborhoods. Compared with state-of-the-art methods, HDF-Net achieves significant improvements in DNSMOS, STOI, and PESQ scores, yielding superior speech quality and intelligibility. Moreover, it reduces model parameters by 32% and computational cost by 27%, achieving an optimal trade-off between performance and latency—making it well-suited for edge-device deployment in real-time applications.
📝 Abstract
This paper proposes a model that integrates sub-band processing and deep filtering to fully exploit information from the target time-frequency (TF) bin and its surrounding TF bins for single-channel speech enhancement. The sub-band module captures surrounding frequency bin information at the input, while the deep filtering module applies filtering at the output to both the target TF bin and its surrounding TF bins. To further improve the model performance, we decouple deep filtering into temporal and frequency components and introduce a two-stage framework, reducing the complexity of filter coefficient prediction at each stage. Additionally, we propose the TAConv module to strengthen convolutional feature extraction. Experimental results demonstrate that the proposed hierarchical deep filtering network (HDF-Net) effectively utilizes surrounding TF bin information and outperforms other advanced systems while using fewer resources.