Handling Numeric Expressions in Automatic Speech Recognition

📅 2024-07-18
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Context-sensitive formatting of numeric expressions (e.g., years, times, monetary amounts) in ASR transcripts remains challenging, as identical digit strings require distinct canonical forms depending on context—e.g., “1945” as a year versus “19:45” as a time. This work proposes a lightweight, end-to-end adaptation framework that, for the first time, jointly leverages LLMs and TTS to synthesize high-fidelity, context-diverse training data for numeric formatting. It further integrates a context-aware parsing strategy to optimize post-processing of numeric tokens—without modifying the underlying ASR model. Compared to cascaded LLM-based rewriting approaches, our method achieves comparable accuracy (within ±0.3% WER gain) on standard benchmarks while significantly reducing computational overhead and maintaining low latency and inference cost. The framework establishes a new paradigm for real-time ASR systems: high-accuracy numeric formatting with minimal computational expense.

Technology Category

Application Category

📝 Abstract
This paper addresses the problem of correctly formatting numeric expressions in automatic speech recognition (ASR) transcripts. This is challenging since the expected transcript format depends on the context, e.g., 1945 (year) vs. 19:45 (timestamp). We compare cascaded and end-to-end approaches to recognize and format numeric expression, such as years, timestamps, currency amounts, and quantities. For the end-to-end approach we employed a data generation strategy using a large language model (LLM) together with a text to speech (TTS) model to generate adaptation data. The results on our test dataset show that while approaches based on LLMs perform well on recognizing formatted numeric expressions, adapted end-to-end models offer competitive performance with the advantage of lower latency and inference cost.
Problem

Research questions and friction points this paper is trying to address.

Correctly formatting numeric expressions in ASR transcripts
Recognizing context-dependent formats like years and timestamps
Comparing cascaded and end-to-end approaches for numeric expression handling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascaded and end-to-end numeric expression recognition
LLM and TTS for synthetic data generation
Adapted end-to-end models reduce latency and cost
🔎 Similar Papers
No similar papers found.