An Effective Training Framework for Light-Weight Automatic Speech Recognition Models

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

To address the challenge of balancing accuracy and training efficiency for lightweight automatic speech recognition (ASR) models on resource-constrained devices, this paper proposes a two-stage representation learning framework. In the first stage, a high-performance teacher model is trained end-to-end. In the second stage, multiple compact student models of varying sizes are efficiently derived from this single teacher via multi-scale knowledge distillation and representation alignment. Unlike conventional pruning or single-stage distillation—both prone to substantial accuracy degradation and redundant retraining—our approach enables joint compression of heterogeneous student models from one teacher training run. Evaluated on standard ASR benchmarks, our method achieves up to a 12.54% relative reduction in word error rate over baselines while accelerating training by 3×. To our knowledge, this is the first work to support one-time large-model training followed by concurrent, size-flexible lightweight model compression, achieving optimal trade-offs among accuracy, training efficiency, and deployment adaptability.

Technology Category

Application Category

📝 Abstract

Recent advancement in deep learning encouraged developing large automatic speech recognition (ASR) models that achieve promising results while ignoring computational and memory constraints. However, deploying such models on low resource devices is impractical despite of their favorable performance. Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. Comprehensive experimentation on ASR benchmarks reveals the efficacy of our approach, achieving three-fold training speed-up and up to 12.54% word error rate improvement.

Problem

Research questions and friction points this paper is trying to address.

Deploying large ASR models on low-resource devices is impractical

Existing approaches degrade performance or require prolonged training

Producing small ASR models with better performance efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-step representation learning for small ASR models

Single large model generates multiple efficient models

Faster training with significant performance improvement

🔎 Similar Papers

No similar papers found.