🤖 AI Summary
This work addresses the limitations of current automatic speech recognition systems in accurately recognizing rare domain-specific terms and the poor scalability of conventional open-vocabulary keyword spotting methods to large-scale lexicons. The authors propose an efficient open-vocabulary keyword detection framework that integrates contextual biasing with compressed feature storage, significantly enhancing recognition performance for rare terms without requiring fine-tuning of the underlying speech recognition model. By introducing an innovative feature representation and fusion mechanism, the method reduces memory consumption by up to 128× while enabling real-time detection over large-scale terminology banks for the first time. Moreover, it maintains entity recall rates on unseen languages comparable to those of uncompressed approaches, substantially improving the system’s scalability and practical applicability.
📝 Abstract
Automatic speech recognition systems have been shown to under-perform when it comes to transcribing words rarely seen in the training data, namely specialized terminology. Open-vocabulary keyword spotting, combined with contextual biasing, has been shown to mitigate this issue. However, existing systems can only handle glossaries of a few hundred terms without becoming an infeasible bottleneck. We propose a system that stores features with a memory footprint up to 128 times smaller than a comparable baseline and allows users to process massive databases while remaining open-vocabulary. Without fine-tuning the speech recognition model, our system achieves a comparable entity recall as uncompressed solutions, even in languages not seen during training.