A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the long-standing empirical nature of vocabulary size selection in end-to-end automatic speech recognition (ASR) systems, which has lacked theoretical grounding. For the first time, it introduces principles from calculus to this problem by modeling learning curves derived from training data and leveraging both first- and second-order derivatives to formally construct an optimization objective for subword units such as Byte Pair Encoding (BPE). This framework enables automatic estimation of the optimal vocabulary size in a principled manner. Evaluated on the LibriSpeech dataset, the proposed method yields vocabulary configurations that significantly improve ASR performance while offering an interpretable and reproducible theoretical foundation for vocabulary design.

📝 Abstract

In hybrid automatic speech recognition (ASR) systems, the vocabulary size is unambiguous, typically determined by the number of phones, bi-phones, or tri-phones present in the language. In contrast, end-to-end ASR systems derive their vocabulary, often referred to as tokens from the text corpus used for training. The choice and, more importantly, the size of this vocabulary is a critical hyper-parameter in training end-to-end ASR systems. Tokenization algorithms such as Byte Pair Encoding (BPE), WordPiece, and Unigram Language Model (ULM) use the vocabulary size as an input hyper-parameter to generate the sub-words employed during ASR training. Popular toolkits like ESPNet provide a fixed vocabulary size in their training recipes, but there is little documentation or discussion in the literature regarding how these values are determined. Recent work [1] has formalized an approach to identify the vocabulary size best suited for end-to-end ASR, introducing a cost function framework that treats the tokenization process as a black box. In this paper, we build upon that foundation by curve fitting the training data and using the principle of first and second derivative tests in calculus to formally estimate the vocabulary size hyper-parameter. We demonstrate the utility and usefulness of our approach by applying it on a standard Librispeech corpus and show that the optimal choice of vocabulary size hyper-parameter improves the performance of the ASR. The main contribution of this paper in formalizing an approach to identify the vocabulary size best suited for training an end-to-end ASR system.

Problem

Research questions and friction points this paper is trying to address.

vocabulary size

end-to-end ASR

tokenization

hyper-parameter

automatic speech recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

vocabulary size estimation

end-to-end ASR

calculus-based optimization