🤖 AI Summary
To address the deployment challenges of large pre-trained speech models for Arabic-language tasks in resource-constrained environments, this work introduces the first lightweight self-supervised speech foundation model family tailored specifically for Arabic. Methodologically, we integrate iterative self-distillation with low-rank approximation to efficiently compress knowledge from a bilingual teacher model into a shallow student architecture, preserving Arabic-specific phonological features—such as pharyngeal consonants and stress patterns—while substantially reducing parameter count and computational cost. Experiments demonstrate that our model achieves state-of-the-art or near-state-of-the-art performance on three Arabic downstream tasks: automatic speech recognition (ASR), speech emotion recognition (SER), and dialect identification (DID), with minimal fine-tuning. It delivers a 3.2× inference speedup and reduces memory footprint by 76%, offering a deployable, high-fidelity, and cost-efficient solution for Arabic speech understanding under low-resource conditions.
📝 Abstract
Large pre-trained speech models excel in downstream tasks but their deployment is impractical for resource-limited environments. In this paper, we introduce HArnESS, the first Arabic-centric self-supervised speech model family, designed to capture Arabic speech nuances. Using iterative self-distillation, we train large bilingual HArnESS (HL) SSL models and then distill knowledge into compressed student models (HS, HST), preserving Arabic-specific representations. We use low-rank approximation to further compact the teacher's discrete supervision into shallow, thin models. We evaluate HArnESS on Arabic ASR, Speaker Emotion Recognition (SER), and Dialect Identification (DID), demonstrating effectiveness against HuBERT and XLS-R. With minimal fine-tuning, HArnESS achieves SOTA or comparable performance, making it a lightweight yet powerful alternative for real-world use. We release our distilled models and findings to support responsible research and deployment in low-resource settings.