🤖 AI Summary
This study addresses the performance degradation in low-resource speech recognition caused by domain mismatch and data scarcity. The authors propose an end-to-end trainable hard Gumbel-Softmax layer selector integrated with the BEST-RQ self-supervised objective to automatically and dynamically select the optimal encoder layers in Whisper, eliminating the need for manual hyperparameter tuning. This work introduces, for the first time, a learnable layer selection mechanism within the Whisper architecture, substantially enhancing cross-domain generalization. Using only 10 hours of labeled data from the MyST dataset, the method matches the performance of a fully supervised baseline trained on 133 hours. It achieves new state-of-the-art word error rates (WER) of 8.21% and 11.06% on MyST and OGI, respectively, and yields up to a 6% relative WER reduction on the CORAAL dialectal dataset.
📝 Abstract
Speech foundation models often struggle in low-resource domains due to domain mismatch and data scarcity. We propose Gumbel-BEARD, a domain adaptation framework that automates Whisper encoder layer selection via an end-to-end trainable hard Gumbel-Softmax selector. It enables self-supervised adaptation with a BEST-RQ objective that dynamically adapts to target acoustic characteristics without manual tuning. Experiments on the MyST child speech corpus demonstrate efficiency and scalability: with 10 h of labeled data for fine-tuning, our method matches a fully supervised baseline trained on the complete 133 h labeled set. We establish new state-of-the-art word error rates (WERs) of 8.21% using Whisper-medium on MyST and 11.06% using Whisper-small on the OGI Spontaneous dataset. Evaluation on CORAAL further confirms robustness to adult dialectal domain shifts, with up to 6% relative WER reduction, highlighting the generalizability of our approach to diverse low-resource conditions.