🤖 AI Summary
Existing stutter detection methods operate solely at the utterance level, limiting their utility for precise speech therapy and real-time intervention. This work proposes StutterCut, the first framework to formulate stutter segment segmentation as a graph partitioning problem: speech embeddings from overlapping windows serve as nodes, while edge weights are optimized via Normalized Cut. We further introduce an uncertainty-aware pseudo-oracle classifier based on Monte Carlo Dropout to dynamically weight pseudo-labels. Additionally, we extend the FluencyBank dataset with frame-level boundary annotations for four non-fluency types. Under weak supervision—using only utterance-level labels—StutterCut achieves strong segmentation performance. Experiments on both real and synthetic data demonstrate significant F1-score improvements over prior methods; moreover, stutter onset detection accuracy and robustness reach new state-of-the-art levels.
📝 Abstract
Detecting and segmenting dysfluencies is crucial for effective speech therapy and real-time feedback. However, most methods only classify dysfluencies at the utterance level. We introduce StutterCut, a semi-supervised framework that formulates dysfluency segmentation as a graph partitioning problem, where speech embeddings from overlapping windows are represented as graph nodes. We refine the connections between nodes using a pseudo-oracle classifier trained on weak (utterance-level) labels, with its influence controlled by an uncertainty measure from Monte Carlo dropout. Additionally, we extend the weakly labelled FluencyBank dataset by incorporating frame-level dysfluency boundaries for four dysfluency types. This provides a more realistic benchmark compared to synthetic datasets. Experiments on real and synthetic datasets show that StutterCut outperforms existing methods, achieving higher F1 scores and more precise stuttering onset detection.