HARNESS: Lightweight Distilled Arabic Speech Foundation Models

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

130K/year

🤖 AI Summary

To address the deployment challenges of large pre-trained speech models for Arabic-language tasks in resource-constrained environments, this work introduces the first lightweight self-supervised speech foundation model family tailored specifically for Arabic. Methodologically, we integrate iterative self-distillation with low-rank approximation to efficiently compress knowledge from a bilingual teacher model into a shallow student architecture, preserving Arabic-specific phonological features—such as pharyngeal consonants and stress patterns—while substantially reducing parameter count and computational cost. Experiments demonstrate that our model achieves state-of-the-art or near-state-of-the-art performance on three Arabic downstream tasks: automatic speech recognition (ASR), speech emotion recognition (SER), and dialect identification (DID), with minimal fine-tuning. It delivers a 3.2× inference speedup and reduces memory footprint by 76%, offering a deployable, high-fidelity, and cost-efficient solution for Arabic speech understanding under low-resource conditions.

Technology Category

Application Category

📝 Abstract

Large pre-trained speech models excel in downstream tasks but their deployment is impractical for resource-limited environments. In this paper, we introduce HArnESS, the first Arabic-centric self-supervised speech model family, designed to capture Arabic speech nuances. Using iterative self-distillation, we train large bilingual HArnESS (HL) SSL models and then distill knowledge into compressed student models (HS, HST), preserving Arabic-specific representations. We use low-rank approximation to further compact the teacher's discrete supervision into shallow, thin models. We evaluate HArnESS on Arabic ASR, Speaker Emotion Recognition (SER), and Dialect Identification (DID), demonstrating effectiveness against HuBERT and XLS-R. With minimal fine-tuning, HArnESS achieves SOTA or comparable performance, making it a lightweight yet powerful alternative for real-world use. We release our distilled models and findings to support responsible research and deployment in low-resource settings.

Problem

Research questions and friction points this paper is trying to address.

Develop lightweight Arabic speech models for resource-constrained environments

Address Arabic-specific speech nuances through distilled foundation models

Enable efficient Arabic ASR, emotion recognition, and dialect identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Arabic-centric self-distilled speech models

Low-rank approximation for compact supervision

Lightweight models preserving Arabic-specific representations

🔎 Similar Papers

No similar papers found.