Improving Contextual ASR via Multi-grained Fusion with Large Language Models

๐Ÿ“… 2025-07-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
End-to-end ASR models perform well on general speech transcription but exhibit insufficient accuracy for context-sensitive proper nouns and user-specific entities. To address this, we propose a keyword-aware, multi-granularity late-fusion framework that jointly models token-level and phrase-level semantic alignment, tightly integrating ASR outputs with the contextual reasoning capabilities of large language models (LLMs). Our approach avoids costly end-to-end fine-tuning by employing dual-path information fusionโ€”enabling fine-grained keyword correction while preserving global semantic consistency. Evaluated on multilingual, multi-scenario datasets, it achieves significant improvements in keyword F1 score (+4.2โ€“6.8 points), with negligible change in overall word error rate (WER) on non-keyword tokens. To our knowledge, this is the first method to achieve joint optimization of keyword awareness and general ASR performance without modifying the underlying ASR architecture.

Technology Category

Application Category

๐Ÿ“ Abstract
While end-to-end Automatic Speech Recognition (ASR) models have shown impressive performance in transcribing general speech, they often struggle to accurately recognize contextually relevant keywords, such as proper nouns or user-specific entities. Previous approaches have explored leveraging keyword dictionaries in the textual modality to improve keyword recognition, either through token-level fusion that guides token-by-token generation or phrase-level fusion that enables direct copying of keyword phrases. However, these methods operate at different granularities and have their own limitations. In this paper, we propose a novel multi-grained fusion approach that jointly leverages the strengths of both token-level and phrase-level fusion with Large Language Models (LLMs). Our approach incorporates a late-fusion strategy that elegantly combines ASR's acoustic information with LLM's rich contextual knowledge, balancing fine-grained token precision with holistic phrase-level understanding. Experiments on Chinese and English datasets demonstrate that our approach achieves state-of-the-art performance on keyword-related metrics while preserving high accuracy on non-keyword text. Ablation studies further confirm that the token-level and phrase-level components both contribute significantly to the performance gains, complementing each other in our joint multi-grained framework. The code and models will be publicly available at https://github.com/.
Problem

Research questions and friction points this paper is trying to address.

Improving contextual keyword recognition in ASR systems
Combining token-level and phrase-level fusion with LLMs
Balancing acoustic and contextual information for better accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-grained fusion with LLMs
Late-fusion of acoustic and contextual info
Joint token and phrase-level fusion
๐Ÿ”Ž Similar Papers
No similar papers found.
Shilin Zhou
Shilin Zhou
School of Computer Science and Technology, Soochow University
Machine LearningNatural Language Processing
Z
Zhenghua Li
School of Computer Science and Technology, Soochow University, Suzhou, China