RNN-Transducer-based Losses for Speech Recognition on Noisy Targets

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In industrial-scale speech recognition, large-scale training data often contains transcription noise—such as deletions and insertions—that severely degrades model performance. To address this, we propose three robust RNN-Transducer (RNN-T) loss functions: Star-Transducer (robust to deletions), Bypass-Transducer (robust to insertions), and Target-Robust Transducer (jointly robust to arbitrary annotation errors). Our approach introduces, for the first time within the RNN-T framework, skip-frame and skip-token transition mechanisms; these extend the loss lattice structure to jointly incorporate anomalous alignment paths alongside canonical ones under unified optimization. Experiments demonstrate substantial improvements in robustness under noisy supervision: Star-, Bypass-, and Target-Robust Transducer recover over 90%, 60%, and 70% of the original clean-performance baseline, respectively. This work establishes a new paradigm for end-to-end speech recognition under imperfect supervision.

Technology Category

Application Category

📝 Abstract
Training speech recognition systems on noisy transcripts is a significant challenge in industrial pipelines, where datasets are enormous and ensuring accurate transcription for every instance is difficult. In this work, we introduce novel loss functions to mitigate the impact of transcription errors in RNN-Transducer models. Our Star-Transducer loss addresses deletion errors by incorporating"skip frame"transitions in the loss lattice, restoring over 90% of the system's performance compared to models trained with accurate transcripts. The Bypass-Transducer loss uses"skip token"transitions to tackle insertion errors, recovering more than 60% of the quality. Finally, the Target-Robust Transducer loss merges these approaches, offering robust performance against arbitrary errors. Experimental results demonstrate that the Target-Robust Transducer loss significantly improves RNN-T performance on noisy data by restoring over 70% of the quality compared to well-transcribed data.
Problem

Research questions and friction points this paper is trying to address.

Mitigating transcription errors in RNN-Transducer models
Addressing deletion errors with skip frame transitions
Combating insertion errors via skip token transitions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Star-Transducer loss skips frames for deletions
Bypass-Transducer loss skips tokens for insertions
Target-Robust Transducer combines both approaches
🔎 Similar Papers
No similar papers found.