KFC-KWS: Keyframe Fusion with CTC for User-Defined Keyword Spotting

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of distinguishing target keywords from phoneme-level confusables in user-defined keyword spotting. To this end, the authors propose a multimodal detection framework that leverages the peak characteristics of CTC posterior distributions to precisely select high-confidence keyframes, thereby enabling effective alignment across audio, phoneme, and text modalities. A cross-attention mechanism is further introduced to jointly exploit the local discriminability of keyframes and the global contextual information of the entire utterance. Evaluated on the LibriPhrase dataset, the proposed model achieves state-of-the-art performance with an AUC of 98.73% overall, and notably attains 97.65% AUC and 7.75% EER on the challenging subset, significantly outperforming existing approaches.

📝 Abstract

User-defined keyword spotting (KWS) enables personalized voice interaction by detecting user-specified keywords. A key challenge in this task is distinguishing target keywords from phonetically confusable alternatives. To address this challenge, we propose KFC-KWS, a multimodal framework that leverages connectionist temporal classification (CTC)-guided keyframe selection. Specifically, we exploit the peaky posterior distributions of CTC to identify high-confidence phoneme frames, enabling precise alignment across audio, phoneme, and text modalities. These keyframes are then fused with full-utterance representations through cross-attention to capture both local discriminative cues and global contextual information. On LibriPhrase, KFC-KWS achieves the best-balanced performance (98.73% AUC) and substantially outperforms advanced baselines on the challenging hard subset (97.65% AUC and 7.75% EER), demonstrating its effectiveness in discriminating between highly confusable keywords.

Problem

Research questions and friction points this paper is trying to address.

keyword spotting

user-defined

phonetic confusion

multimodal alignment

confusable keywords

Innovation

Methods, ideas, or system contributions that make the work stand out.

keyframe fusion

CTC-guided alignment

user-defined keyword spotting