🤖 AI Summary
This work addresses the challenging problem of low-latency single-channel speech enhancement in large-volume reverberant environments (e.g., conference rooms, theaters), characterized by far-field acquisition (5–10 m), high room volume (>1000 m³), and long reverberation times (T60 > 1 s). To tackle this, we first systematically demonstrate the feasibility of far-field single-channel speech enhancement. We propose an early-reflection-aware reverberation modeling and suppression strategy, departing from conventional full-reverberation suppression paradigms. A physics-informed random impulse response (RIR) simulation method is designed to explicitly model the coupled dependence of T60 on room volume—critical for realistic training data generation. Furthermore, we develop a lightweight, real-time deep time-frequency masking network. Experiments show substantial improvements: +2.1 in PESQ, +18.3% in STOI, and end-to-end latency <40 ms—achieving significant gains in speech intelligibility and naturalness while meeting strict real-time constraints.
📝 Abstract
Dereverberation is an important sub-task of Speech Enhancement (SE) to improve the signal's intelligibility and quality. However, it remains challenging because the reverberation is highly correlated with the signal. Furthermore, the single-channel SE literature has predominantly focused on rooms with short reverb times (typically under 1 second), smaller rooms (under volumes of 1000 cubic meters) and relatively short distances (up to 2 meters). In this paper, we explore real-time low-latency single-channel SE under distant microphone scenarios, such as 5 to 10 meters, and focus on conference rooms and theatres, with larger room dimensions and reverberation times. Such a setup is useful for applications such as lecture demonstrations, drama, and to enhance stage acoustics. First, we show that single-channel SE in such challenging scenarios is feasible. Second, we investigate the relationship between room volume and reverberation time, and demonstrate its importance when randomly simulating room impulse responses. Lastly, we show that for dereverberation with short decay times, preserving early reflections before decaying the transfer function of the room improves overall signal quality.