🤖 AI Summary
To address privacy concerns and computational constraints in deploying automatic speech recognition (ASR) for children’s speech on edge devices, this work proposes the first lightweight, child-optimized, on-device ASR solution. Starting from the Whisper tiny.en model, we introduce a domain-specific pipeline comprising child speech data filtering, targeted fine-tuning on pediatric corpora, and low-rank model compression. This approach preserves recognition accuracy while substantially reducing model complexity. Evaluated on the MyST corpus, our method achieves a word error rate of 11.8% (after data filtering), reduces computational cost by approximately 2 GFLOPS, accelerates inference by 1.26×, and attains a stable real-time factor of 0.23–0.41 on Raspberry Pi hardware. To the best of our knowledge, this is the first end-to-end edge ASR system for children’s speech that simultaneously delivers high accuracy, low latency, and strong on-device privacy protection.
📝 Abstract
Reliability on cloud providers for ASR inference to support child-centered voice-based applications is becoming challenging due to regulatory and privacy challenges. Motivated by a privacy-preserving design, this study aims to develop a lightweight & efficient Whisper ASR system capable of running on a Raspberry Pi. Upon evaluation of the MyST corpus and by examining various filtering strategies to fine-tune the `tiny.en' model, a Word Error Rate (WER) of 15.9% was achieved (11.8% filtered). A low-rank compression reduces the encoder size by 0.51M with 1.26x faster inference in GPU, with 11% relative WER increase. During inference on Pi, the compressed version required ~2 GFLOPS fewer computations. The RTF for both the models ranged between [0.23-0.41] for various input audio durations. Analyzing the RAM usage and CPU temperature showed that the PI was capable of handling both the tiny models, however it was noticed that small models initiated additional overhead/thermal throttling.