Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

For ASR tasks requiring domain-specific knowledge—such as conference speeches—existing models struggle to capture long-range semantic dependencies due to fixed context window limits and sparse salient information within extended contexts. To address this, we propose SAP², a Speech-Adaptive Pooling and Pruning framework. First, it employs a speech-driven attention pooling mechanism to dynamically compress long-context representations, preserving only the most relevant domain-specific keywords conditioned on the current acoustic input. Second, it integrates a two-stage dynamic pruning and fusion strategy to jointly optimize noise suppression and contextual enhancement. Evaluated on SlideSpeech and LibriSpeech, SAP² achieves WERs of 7.71% and 1.12%, respectively—reducing the context-free baseline (B-WER) by 41.1%. Moreover, it demonstrates superior scalability to long contexts, establishing an efficient and robust paradigm for contextualized ASR.

Technology Category

Application Category

📝 Abstract

Automatic speech recognition (ASR) systems have achieved remarkable performance in common conditions but often struggle to leverage long-context information in contextualized scenarios that require domain-specific knowledge, such as conference presentations. This challenge arises primarily due to constrained model context windows and the sparsity of relevant information within extensive contextual noise. To solve this, we propose the SAP$^{2}$ method, a novel framework that dynamically prunes and integrates relevant contextual keywords in two stages. Specifically, each stage leverages our proposed Speech-Driven Attention-based Pooling mechanism, enabling efficient compression of context embeddings while preserving speech-salient information. Experimental results demonstrate state-of-the-art performance of SAP$^{2}$ on the SlideSpeech and LibriSpeech datasets, achieving word error rates (WER) of 7.71% and 1.12%, respectively. On SlideSpeech, our method notably reduces biased keyword error rates (B-WER) by 41.1% compared to non-contextual baselines. SAP$^{2}$ also exhibits robust scalability, consistently maintaining performance under extensive contextual input conditions on both datasets.

Problem

Research questions and friction points this paper is trying to address.

Improving ASR performance in long-context domain-specific scenarios

Addressing sparse relevant information within extensive contextual noise

Overcoming constrained model context windows in speech recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamically prunes and integrates contextual keywords

Uses Speech-Driven Attention-based Pooling mechanism

Compresses context embeddings while preserving speech information

🔎 Similar Papers

No similar papers found.

Authors to Follow