BaldWhisper: Faster Whisper with Head Shearing and Layer Merging

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Deploying speech recognition models for low-resource languages (e.g., Bambara) on edge devices is challenging due to extremely limited labeled data (only 32 hours) and vocabulary pruning failure caused by language switching. To address this, we propose a lightweight Transformer compression framework tailored for edge deployment. Instead of conventional layer removal, our approach employs layer merging—combined with multi-head pruning, low-rank decomposition of embedding layers, and feature-level knowledge distillation—to significantly reduce retraining data requirements without modifying the vocabulary. Experiments demonstrate a 48% reduction in model size, a 2.15× inference speedup on a MacBook Air M1, and only a ~10% relative WER increase—preserving 90% of the original accuracy. This framework establishes an efficient and robust compression paradigm for deploying low-resource ASR models on resource-constrained edge devices.

Technology Category

Application Category

📝 Abstract

Pruning large pre-trained transformers for low-resource languages is challenging, as it often requires massive retraining data to recover performance. For instance, Distill-Whisper prunes Whisper by 40% and retrains on 21,000 hours of speech, far beyond what is available for most languages. Can Whisper be made lighter and faster for edge devices in data-scarce settings? Focusing on Bambara with only 32h of speech-to-text data, we propose a new pruning recipe. Instead of vocabulary pruning, which is unsuitable due to frequent code-switching by Bambara speakers, we compress the embeddings with low-rank decomposition and feature distillation. Rather than removing layers, we merge them to limit performance loss. The final model preserves 90% of the original performance while being 48% smaller and 2.15x faster on a MacBook Air M1.

Problem

Research questions and friction points this paper is trying to address.

Pruning transformers for low-resource languages without massive retraining data

Compressing Whisper model for edge devices using limited speech data

Maintaining performance while reducing model size and increasing inference speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compress embeddings using low-rank decomposition

Merge layers instead of removing them

Apply feature distillation to reduce model size

🔎 Similar Papers

No similar papers found.