🤖 AI Summary
Context-sensitive formatting of numeric expressions (e.g., years, times, monetary amounts) in ASR transcripts remains challenging, as identical digit strings require distinct canonical forms depending on context—e.g., “1945” as a year versus “19:45” as a time. This work proposes a lightweight, end-to-end adaptation framework that, for the first time, jointly leverages LLMs and TTS to synthesize high-fidelity, context-diverse training data for numeric formatting. It further integrates a context-aware parsing strategy to optimize post-processing of numeric tokens—without modifying the underlying ASR model. Compared to cascaded LLM-based rewriting approaches, our method achieves comparable accuracy (within ±0.3% WER gain) on standard benchmarks while significantly reducing computational overhead and maintaining low latency and inference cost. The framework establishes a new paradigm for real-time ASR systems: high-accuracy numeric formatting with minimal computational expense.
📝 Abstract
This paper addresses the problem of correctly formatting numeric expressions in automatic speech recognition (ASR) transcripts. This is challenging since the expected transcript format depends on the context, e.g., 1945 (year) vs. 19:45 (timestamp). We compare cascaded and end-to-end approaches to recognize and format numeric expression, such as years, timestamps, currency amounts, and quantities. For the end-to-end approach we employed a data generation strategy using a large language model (LLM) together with a text to speech (TTS) model to generate adaptation data. The results on our test dataset show that while approaches based on LLMs perform well on recognizing formatted numeric expressions, adapted end-to-end models offer competitive performance with the advantage of lower latency and inference cost.