Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation

πŸ“… 2025-05-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing deep biasing methods independently enhance subword units of context phrases, compromising their semantic integrity and degrading ASR performance. To address this, we propose a phrase-level contextualized speech recognition framework. Our method introduces an encoder enhancement architecture, pioneering phrase-level dynamic vocabulary prediction coupled with a confidence-driven activation decoding mechanism to model holistic semantic units. Additionally, we design a frame-to-phrase bias loss function that explicitly enforces output completeness at the phrase level and suppresses erroneous biasing. Evaluated on LibriSpeech and WenetSpeech, our approach achieves relative WER reductions of 28.31% and 23.49%, respectively, while context phrase WER drops dramatically by 72.04% and 75.69%. These results demonstrate substantial improvements in both robustness and accuracy for critical phrase recognition.

Technology Category

Application Category

πŸ“ Abstract
Deep biasing improves automatic speech recognition (ASR) performance by incorporating contextual phrases. However, most existing methods enhance subwords in a contextual phrase as independent units, potentially compromising contextual phrase integrity, leading to accuracy reduction. In this paper, we propose an encoder-based phrase-level contextualized ASR method that leverages dynamic vocabulary prediction and activation. We introduce architectural optimizations and integrate a bias loss to extend phrase-level predictions based on frame-level outputs. We also introduce a confidence-activated decoding method that ensures the complete output of contextual phrases while suppressing incorrect bias. Experiments on Librispeech and Wenetspeech datasets demonstrate that our approach achieves relative WER reductions of 28.31% and 23.49% compared to baseline, with the WER on contextual phrases decreasing relatively by 72.04% and 75.69%.
Problem

Research questions and friction points this paper is trying to address.

Improves ASR by enhancing contextual phrase integrity
Reduces word error rates with dynamic vocabulary prediction
Ensures complete output of contextual phrases confidently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Encoder-based phrase-level contextualized ASR method
Dynamic vocabulary prediction and activation
Confidence-activated decoding for phrase integrity
πŸ”Ž Similar Papers
No similar papers found.
Z
Zhennan Lin
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
Kaixun Huang
Kaixun Huang
Northwestern Polytechnical University
W
Wei Ren
Chongqing Changan Automobile Co., Ltd., China
L
Linju Yang
Chongqing Changan Automobile Co., Ltd., China
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China