๐ค AI Summary
End-to-end ASR models perform well on general speech transcription but exhibit insufficient accuracy for context-sensitive proper nouns and user-specific entities. To address this, we propose a keyword-aware, multi-granularity late-fusion framework that jointly models token-level and phrase-level semantic alignment, tightly integrating ASR outputs with the contextual reasoning capabilities of large language models (LLMs). Our approach avoids costly end-to-end fine-tuning by employing dual-path information fusionโenabling fine-grained keyword correction while preserving global semantic consistency. Evaluated on multilingual, multi-scenario datasets, it achieves significant improvements in keyword F1 score (+4.2โ6.8 points), with negligible change in overall word error rate (WER) on non-keyword tokens. To our knowledge, this is the first method to achieve joint optimization of keyword awareness and general ASR performance without modifying the underlying ASR architecture.
๐ Abstract
While end-to-end Automatic Speech Recognition (ASR) models have shown impressive performance in transcribing general speech, they often struggle to accurately recognize contextually relevant keywords, such as proper nouns or user-specific entities.
Previous approaches have explored leveraging keyword dictionaries in the textual modality to improve keyword recognition, either through token-level fusion that guides token-by-token generation or phrase-level fusion that enables direct copying of keyword phrases.
However, these methods operate at different granularities and have their own limitations.
In this paper, we propose a novel multi-grained fusion approach that jointly leverages the strengths of both token-level and phrase-level fusion with Large Language Models (LLMs).
Our approach incorporates a late-fusion strategy that elegantly combines ASR's acoustic information with LLM's rich contextual knowledge, balancing fine-grained token precision with holistic phrase-level understanding.
Experiments on Chinese and English datasets demonstrate that our approach achieves state-of-the-art performance on keyword-related metrics while preserving high accuracy on non-keyword text.
Ablation studies further confirm that the token-level and phrase-level components both contribute significantly to the performance gains, complementing each other in our joint multi-grained framework.
The code and models will be publicly available at https://github.com/.