🤖 AI Summary
For ASR tasks requiring domain-specific knowledge—such as conference speeches—existing models struggle to capture long-range semantic dependencies due to fixed context window limits and sparse salient information within extended contexts. To address this, we propose SAP², a Speech-Adaptive Pooling and Pruning framework. First, it employs a speech-driven attention pooling mechanism to dynamically compress long-context representations, preserving only the most relevant domain-specific keywords conditioned on the current acoustic input. Second, it integrates a two-stage dynamic pruning and fusion strategy to jointly optimize noise suppression and contextual enhancement. Evaluated on SlideSpeech and LibriSpeech, SAP² achieves WERs of 7.71% and 1.12%, respectively—reducing the context-free baseline (B-WER) by 41.1%. Moreover, it demonstrates superior scalability to long contexts, establishing an efficient and robust paradigm for contextualized ASR.
📝 Abstract
Automatic speech recognition (ASR) systems have achieved remarkable performance in common conditions but often struggle to leverage long-context information in contextualized scenarios that require domain-specific knowledge, such as conference presentations. This challenge arises primarily due to constrained model context windows and the sparsity of relevant information within extensive contextual noise. To solve this, we propose the SAP$^{2}$ method, a novel framework that dynamically prunes and integrates relevant contextual keywords in two stages. Specifically, each stage leverages our proposed Speech-Driven Attention-based Pooling mechanism, enabling efficient compression of context embeddings while preserving speech-salient information. Experimental results demonstrate state-of-the-art performance of SAP$^{2}$ on the SlideSpeech and LibriSpeech datasets, achieving word error rates (WER) of 7.71% and 1.12%, respectively. On SlideSpeech, our method notably reduces biased keyword error rates (B-WER) by 41.1% compared to non-contextual baselines. SAP$^{2}$ also exhibits robust scalability, consistently maintaining performance under extensive contextual input conditions on both datasets.