🤖 AI Summary
End-to-end time-synchronous automatic speech recognition (ASR) models exhibit weak contextual modeling—particularly in leveraging right-context label information—under low-resource conditions. Method: We propose a factorized loss function based on full-sequence alignment summation, the first to rigorously model bidirectional label context within a discriminative training framework while relaxing conventional normalization constraints. Our approach integrates time-synchronous sequence-to-sequence modeling with a hybrid neural network–hidden Markov model (NN-HMM) architecture, enabling training purely under the all-path sum criterion. Results: Experiments on Switchboard-300h and LibriSpeech-960h demonstrate substantial WER reductions in low-data regimes, empirically validating the critical role of explicit right-context modeling for generalization. This work establishes a novel paradigm for low-resource ASR by unifying discriminative training with bidirectional contextual awareness without normalization bottlenecks.
📝 Abstract
Current time-synchronous sequence-to-sequence automatic speech recognition (ASR) models are trained by using sequence level cross-entropy that sums over all alignments. Due to the discriminative formulation, incorporating the right label context into the training criterion's gradient causes normalization problems and is not mathematically well-defined. The classic hybrid neural network hidden Markov model (NN-HMM) with its inherent generative formulation enables conditioning on the right label context. However, due to the HMM state-tying the identity of the right label context is never modeled explicitly. In this work, we propose a factored loss with auxiliary left and right label contexts that sums over all alignments. We show that the inclusion of the right label context is particularly beneficial when training data resources are limited. Moreover, we also show that it is possible to build a factored hybrid HMM system by relying exclusively on the full-sum criterion. Experiments were conducted on Switchboard 300h and LibriSpeech 960h.