CLAR: CIF-Localized Alignment for Retrieval-Augmented Speech LLM-Based Contextual ASR

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that speech large language models (Speech LLMs) struggle to accurately localize hotwords and long-tail named entities under weak supervision due to strong language model priors. To this end, we propose CLAR, a dual-encoder speech-text retrieval framework that, for the first time, leverages the Continuous Integrate-and-Fire (CIF) mechanism to achieve timestamp-free monotonic alignment at the token level in an unsupervised manner. CLAR further incorporates length-aware local matching to enhance acoustic cues for short entities. Through multi-granularity contrastive learning and CIF-based quantity constraints, our approach effectively mitigates representation dilution and attention drift. Experimental results demonstrate that CLAR significantly improves hotword retrieval accuracy and substantially reduces both character error rate (CER) and named entity word error rate (B-WER) over strong baselines.

Technology Category

Application Category

📝 Abstract
Speech LLM-based ASR often struggles with named entities and long-tail words due to strong internal language-model priors. Retrieval-augmented biasing can help, but its effectiveness depends on accurate hotword localization in full-utterance speech under weak supervision. We propose CLAR, a dual-encoder speech-text retriever that uses Continuous Integrate-and-Fire (CIF) to learn monotonic token-level alignments without timestamps. With length-aware localized matching, CLAR anchors short-entity acoustic cues and reduces representation dilution and attention drift. The retriever is trained with a multi-granularity objective combining global and local segment-level contrastive losses and a CIF quantity constraint. At inference, top-ranked hotwords are injected as contextual prompts for the Speech LLM, improving recognition without shallow fusion. Experiments show that CLAR significantly improves hotword retrieval and reduces both CER and B-WER against strong contextual ASR baselines.
Problem

Research questions and friction points this paper is trying to address.

speech LLM
contextual ASR
hotword localization
named entity recognition
retrieval-augmented ASR
Innovation

Methods, ideas, or system contributions that make the work stand out.

CIF-based alignment
retrieval-augmented ASR
monotonic token alignment
length-aware localized matching
speech-text dual encoder
🔎 Similar Papers
No similar papers found.
S
Shangkun Huang
BRV oice Team, Bairong, Inc., China
H
Huan Shen
BRV oice Team, Bairong, Inc., China
Wei Zou
Wei Zou
PKU、Samsung、Baidu、Didi、Ke
SpeechNLPLLMMultimodal
Y
Yunzhang Chen
BRV oice Team, Bairong, Inc., China