KFC-KWS: Keyframe Fusion with CTC for User-Defined Keyword Spotting

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of distinguishing target keywords from phoneme-level confusables in user-defined keyword spotting. To this end, the authors propose a multimodal detection framework that leverages the peak characteristics of CTC posterior distributions to precisely select high-confidence keyframes, thereby enabling effective alignment across audio, phoneme, and text modalities. A cross-attention mechanism is further introduced to jointly exploit the local discriminability of keyframes and the global contextual information of the entire utterance. Evaluated on the LibriPhrase dataset, the proposed model achieves state-of-the-art performance with an AUC of 98.73% overall, and notably attains 97.65% AUC and 7.75% EER on the challenging subset, significantly outperforming existing approaches.
📝 Abstract
User-defined keyword spotting (KWS) enables personalized voice interaction by detecting user-specified keywords. A key challenge in this task is distinguishing target keywords from phonetically confusable alternatives. To address this challenge, we propose KFC-KWS, a multimodal framework that leverages connectionist temporal classification (CTC)-guided keyframe selection. Specifically, we exploit the peaky posterior distributions of CTC to identify high-confidence phoneme frames, enabling precise alignment across audio, phoneme, and text modalities. These keyframes are then fused with full-utterance representations through cross-attention to capture both local discriminative cues and global contextual information. On LibriPhrase, KFC-KWS achieves the best-balanced performance (98.73% AUC) and substantially outperforms advanced baselines on the challenging hard subset (97.65% AUC and 7.75% EER), demonstrating its effectiveness in discriminating between highly confusable keywords.
Problem

Research questions and friction points this paper is trying to address.

keyword spotting
user-defined
phonetic confusion
multimodal alignment
confusable keywords
Innovation

Methods, ideas, or system contributions that make the work stand out.

keyframe fusion
CTC-guided alignment
user-defined keyword spotting
multimodal fusion
phoneme-level discrimination