🤖 AI Summary
To address low accuracy and poor real-time performance of offline and streaming automatic speech recognition (ASR) in air traffic control (ATC) scenarios, this paper proposes a domain-specific self-supervised pretraining paradigm. Leveraging 4.5 thousand hours of unlabeled ATC speech, we construct and train BEST-RQ, a dedicated ATC speech encoder. We further design a low-latency streaming architecture integrating blockwise attention and dynamic convolution to enable efficient in-domain representation learning. Compared to general-purpose speech encoders, our approach achieves significantly lower word error rates (WER) on standard ATC benchmarks and demonstrates superior robustness under challenging conditions—including varying signal-to-noise ratios and terminology-dense utterances—while maintaining strong domain adaptability. To the best of our knowledge, this is the first work to jointly optimize self-supervised pretraining with a lightweight streaming architecture for safety-critical aviation ASR, establishing a reusable technical framework for professional vertical-domain ASR applications.
📝 Abstract
In this study, we investigate the benefits of domain-specific self-supervised pre-training for both offline and streaming ASR in Air Traffic Control (ATC) environments. We train BEST-RQ models on 4.5k hours of unlabeled ATC data, then fine-tune on a smaller supervised ATC set. To enable real-time processing, we propose using chunked attention and dynamic convolutions, ensuring low-latency inference. We compare these in-domain SSL models against state-of-the-art, general-purpose speech encoders such as w2v-BERT 2.0 and HuBERT. Results show that domain-adapted pre-training substantially improves performance on standard ATC benchmarks, significantly reducing word error rates when compared to models trained on broad speech corpora. Furthermore, the proposed streaming approach further improves word error rate under tighter latency constraints, making it particularly suitable for safety-critical aviation applications. These findings highlight that specializing SSL representations for ATC data is a practical path toward more accurate and efficient ASR systems in real-world operational settings.