🤖 AI Summary
To address the reliance on hyperparameter tuning in unsupervised clustering for speaker diarization and the high computational cost of existing online end-to-end approaches, this paper proposes the first hyperparameter-free end-to-end online neural clustering framework. The system processes streaming speech inputs via an RNN-based chunk concatenation mechanism to model inter-chunk temporal dependencies, and introduces an adaptive centroid refinement decoder that replaces conventional unsupervised clustering modules—enabling real-time, differentiable online clustering without overlapping chunks. Built upon the EEND-EDA architecture, it performs chunk-wise online inference, substantially reducing computational complexity. Evaluated on the CallHome two-speaker dataset, the method achieves state-of-the-art performance while attaining the optimal trade-off between diarization error rate (DER) and inference efficiency.
📝 Abstract
We introduce O-EENC-SD: an end-to-end online speaker diarization system based on EEND-EDA, featuring a novel RNN-based stitching mechanism for online prediction. In particular, we develop a novel centroid refinement decoder whose usefulness is assessed through a rigorous ablation study. Our system provides key advantages over existing methods: a hyperparameter-free solution compared to unsupervised clustering approaches, and a more efficient alternative to current online end-to-end methods, which are computationally costly. We demonstrate that O-EENC-SD is competitive with the state of the art in the two-speaker conversational telephone speech domain, as tested on the CallHome dataset. Our results show that O-EENC-SD provides a great trade-off between DER and complexity, even when working on independent chunks with no overlap, making the system extremely efficient.