🤖 AI Summary
This work addresses the limitations of conventional pipeline-based automatic speech recognition (ASR) and named entity recognition (NER), including entity omission and low information density in transcriptions, by proposing the first end-to-end joint speech understanding framework that directly maps speech to structured text annotated with open-type entity labels. Methodologically, it extends the Whisper architecture with prompt-driven training, synthetic speech–text–entity triplet data augmentation, and an open-NER-label-guided autoregressive decoding mechanism. Its core contribution lies in unifying ASR and NER into a single model that discards the closed-domain entity assumption, enabling dynamic recognition of newly introduced entity types. Experiments demonstrate substantial improvements over natural baselines across cross-domain open-NER and supervised fine-tuning tasks: entity recall increases by 12.3%, and transcription information density rises by 27.6%, validating the dual benefits of joint modeling—enhanced semantic depth and improved generalization in speech understanding.
📝 Abstract
Integrating named entity recognition (NER) with automatic speech recognition (ASR) can significantly enhance transcription accuracy and informativeness. In this paper, we introduce WhisperNER, a novel model that allows joint speech transcription and entity recognition. WhisperNER supports open-type NER, enabling recognition of diverse and evolving entities at inference. Building on recent advancements in open NER research, we augment a large synthetic dataset with synthetic speech samples. This allows us to train WhisperNER on a large number of examples with diverse NER tags. During training, the model is prompted with NER labels and optimized to output the transcribed utterance along with the corresponding tagged entities. To evaluate WhisperNER, we generate synthetic speech for commonly used NER benchmarks and annotate existing ASR datasets with open NER tags. Our experiments demonstrate that WhisperNER outperforms natural baselines on both out-of-domain open type NER and supervised finetuning.