Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In low-resource, short-context streaming ASR post-processing, inverse text normalization (ITN) faces a fundamental accuracy–latency trade-off. To address this, we propose a dynamic context-aware streaming ITN method. Our approach introduces a learnable dynamic chunking mechanism and a right-context fusion strategy, overcoming the limitations of fixed-window methods. It integrates pretrained language models, sliding-window contextual modeling, and streaming incremental decoding to enable real-time, context-sensitive ITN. Evaluated on a Vietnamese dataset, our method achieves accuracy comparable to offline (non-streaming) ITN while substantially outperforming existing streaming ITN approaches. The end-to-end latency remains below 100 ms, satisfying stringent real-time requirements. The method has been successfully deployed in a production-grade streaming ASR system, demonstrating practical efficacy and robustness. Key contributions include: (i) the first learnable dynamic chunking framework for streaming ITN; (ii) effective right-context integration under strict latency constraints; and (iii) state-of-the-art accuracy–latency balance in low-resource, short-context settings.

Technology Category

Application Category

📝 Abstract
Inverse Text Normalization (ITN) is crucial for converting spoken Automatic Speech Recognition (ASR) outputs into well-formatted written text, enhancing both readability and usability. Despite its importance, the integration of streaming ITN within streaming ASR remains largely unexplored due to challenges in accuracy, efficiency, and adaptability, particularly in low-resource and limited-context scenarios. In this paper, we introduce a streaming pretrained language model for ITN, leveraging pretrained linguistic representations for improved robustness. To address streaming constraints, we propose Dynamic Context-Aware during training and inference, enabling adaptive chunk size adjustments and the integration of right-context information. Experimental results demonstrate that our method achieves accuracy comparable to non-streaming ITN and surpasses existing streaming ITN models on a Vietnamese dataset, all while maintaining low latency, ensuring seamless integration into ASR systems.
Problem

Research questions and friction points this paper is trying to address.

Improving accuracy and efficiency in streaming Inverse Text Normalization (ITN)
Addressing low-resource and limited-context challenges in ITN
Enabling adaptive chunk size adjustments for streaming ITN
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Context-Aware for adaptive chunk adjustments
Pretrained linguistic representations for robustness
Low-latency streaming integration with ASR
🔎 Similar Papers
No similar papers found.
L
Luong Ho
Zalo AI, Vietnam; University of Science, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam
K
Khanh Le
Zalo AI, Vietnam; University of Science, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam
V
Vinh Pham
Zalo AI, Vietnam; University of Science, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam
Bao Nguyen
Bao Nguyen
Hanoi University of Science and Technology
Machine learningOptimizationStatistics
T
Tan Tran
Zalo AI, Vietnam; University of Science, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam
D
D. Chau
University of Science, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam