Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
For ASR tasks requiring domain-specific knowledge—such as conference speeches—existing models struggle to capture long-range semantic dependencies due to fixed context window limits and sparse salient information within extended contexts. To address this, we propose SAP², a Speech-Adaptive Pooling and Pruning framework. First, it employs a speech-driven attention pooling mechanism to dynamically compress long-context representations, preserving only the most relevant domain-specific keywords conditioned on the current acoustic input. Second, it integrates a two-stage dynamic pruning and fusion strategy to jointly optimize noise suppression and contextual enhancement. Evaluated on SlideSpeech and LibriSpeech, SAP² achieves WERs of 7.71% and 1.12%, respectively—reducing the context-free baseline (B-WER) by 41.1%. Moreover, it demonstrates superior scalability to long contexts, establishing an efficient and robust paradigm for contextualized ASR.

Technology Category

Application Category

📝 Abstract
Automatic speech recognition (ASR) systems have achieved remarkable performance in common conditions but often struggle to leverage long-context information in contextualized scenarios that require domain-specific knowledge, such as conference presentations. This challenge arises primarily due to constrained model context windows and the sparsity of relevant information within extensive contextual noise. To solve this, we propose the SAP$^{2}$ method, a novel framework that dynamically prunes and integrates relevant contextual keywords in two stages. Specifically, each stage leverages our proposed Speech-Driven Attention-based Pooling mechanism, enabling efficient compression of context embeddings while preserving speech-salient information. Experimental results demonstrate state-of-the-art performance of SAP$^{2}$ on the SlideSpeech and LibriSpeech datasets, achieving word error rates (WER) of 7.71% and 1.12%, respectively. On SlideSpeech, our method notably reduces biased keyword error rates (B-WER) by 41.1% compared to non-contextual baselines. SAP$^{2}$ also exhibits robust scalability, consistently maintaining performance under extensive contextual input conditions on both datasets.
Problem

Research questions and friction points this paper is trying to address.

Improving ASR performance in long-context domain-specific scenarios
Addressing sparse relevant information within extensive contextual noise
Overcoming constrained model context windows in speech recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamically prunes and integrates contextual keywords
Uses Speech-Driven Attention-based Pooling mechanism
Compresses context embeddings while preserving speech information
🔎 Similar Papers
No similar papers found.
Y
Yiming Rong
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, China
Y
Yixin Zhang
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, China
Z
Ziyi Wang
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, China
D
Deyang Jiang
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, China
Y
Yunlong Zhao
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, China
H
Haoran Wu
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, China
Shiyu Zhou
Shiyu Zhou
Professor of Industrial Engineering
Industrial engineeringmanufacturingquality controlapplied statistics
B
Bo Xu
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, China