Beyond the Utterance: An Empirical Study of Very Long Context Speech Recognition

📅 2026-02-04

🏛️ IEEE Transactions on Audio, Speech, and Language Processing

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the limitations of conventional automatic speech recognition (ASR) systems, which are constrained by short-term context modeling and the i.i.d. assumption, hindering effective exploitation of long-range acoustic and linguistic information. For the first time, this work empirically demonstrates that incorporating contextual information spanning up to 21.8 minutes significantly improves recognition performance in end-to-end attention-based ASR models, achieving a relative word error rate reduction of up to 14.2%. Through systematic evaluation of key architectural factors—including positional encoding schemes, model depth, and width—the study elucidates their impact on modeling ultra-long sequences (up to one hour), thereby providing both an effective architecture and empirical foundation for long-context speech recognition.

Technology Category

Application Category

📝 Abstract

Automatic speech recognition (ASR) models are normally trained to operate over single utterances, with a short duration of less than 30 seconds. This choice has been made in part due to computational constraints, but also reflects a common, but often inaccurate, modeling assumption that treats utterances as independent and identically distributed samples. When long-format audio recordings are available, to work with such systems, these recordings must first be segmented into short utterances and processed independently. In this work, we show that due to recent algorithmic and hardware advances, this is no longer necessary, and current attention-based approaches can be used to train ASR systems that operate on sequences of over an hour in length. Therefore, to gain a better understanding of the relationship between the training/evaluation sequence length and performance, we train ASR models on large-scale data using 10 different sequence lengths from 10 seconds up to 1 h. The results show a benefit from using up to 21.8 minutes of context, with up to a 14.2% relative improvement from a short context baseline in our primary experiments. Through modifying various architectural components, we find that the method of encoding positional information and the model’s width/depth are important factors when working with long sequences. Finally, a series of evaluations using synthetic data are constructed to help analyse the model’s use of context. From these results, it is clear that both linguistic and acoustic aspects of the distant context are being used by the model.

Problem

Research questions and friction points this paper is trying to address.

automatic speech recognition

long-context modeling

utterance independence assumption

context utilization

sequence length

Innovation

Methods, ideas, or system contributions that make the work stand out.

long-context speech recognition

attention-based ASR

positional encoding

context modeling

end-to-end speech recognition

🔎 Similar Papers

No similar papers found.

Authors to Follow