Gumbel-BEARD: Automatic Layer Selection for Self-Supervised Adaptation of Whisper in Low-Resource Domains

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the performance degradation in low-resource speech recognition caused by domain mismatch and data scarcity. The authors propose an end-to-end trainable hard Gumbel-Softmax layer selector integrated with the BEST-RQ self-supervised objective to automatically and dynamically select the optimal encoder layers in Whisper, eliminating the need for manual hyperparameter tuning. This work introduces, for the first time, a learnable layer selection mechanism within the Whisper architecture, substantially enhancing cross-domain generalization. Using only 10 hours of labeled data from the MyST dataset, the method matches the performance of a fully supervised baseline trained on 133 hours. It achieves new state-of-the-art word error rates (WER) of 8.21% and 11.06% on MyST and OGI, respectively, and yields up to a 6% relative WER reduction on the CORAAL dialectal dataset.

📝 Abstract

Speech foundation models often struggle in low-resource domains due to domain mismatch and data scarcity. We propose Gumbel-BEARD, a domain adaptation framework that automates Whisper encoder layer selection via an end-to-end trainable hard Gumbel-Softmax selector. It enables self-supervised adaptation with a BEST-RQ objective that dynamically adapts to target acoustic characteristics without manual tuning. Experiments on the MyST child speech corpus demonstrate efficiency and scalability: with 10 h of labeled data for fine-tuning, our method matches a fully supervised baseline trained on the complete 133 h labeled set. We establish new state-of-the-art word error rates (WERs) of 8.21% using Whisper-medium on MyST and 11.06% using Whisper-small on the OGI Spontaneous dataset. Evaluation on CORAAL further confirms robustness to adult dialectal domain shifts, with up to 6% relative WER reduction, highlighting the generalizability of our approach to diverse low-resource conditions.

Problem

Research questions and friction points this paper is trying to address.

low-resource domains

domain mismatch

data scarcity

speech foundation models

self-supervised adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gumbel-Softmax selector

self-supervised adaptation

layer selection