🤖 AI Summary
This work addresses the spontaneous emergence of syllable-level structure in unsupervised speech representation learning. We propose a sentence-level self-distillation framework that, without any external labels or multimodal supervision, implicitly induces HuBERT to model syllable boundaries and organization: sentence-level targets are constructed via token aggregation, and frame-level representation analysis is jointly performed with syllable alignment evaluation. To our knowledge, this is the first demonstration of stable, interpretable, and high-accuracy syllable structure emergence from purely self-supervised sentence-level distillation. We introduce a novel evaluation task, Spoken Speech ABX, specifically designed for assessing spoken-word-level discriminability and boundary fidelity. Experiments show substantial improvements in unsupervised syllable discovery: our method outperforms prior models on the Spoken Speech ABX benchmark and achieves state-of-the-art alignment accuracy between learned representation boundaries and ground-truth syllable boundaries.
📝 Abstract
Data-driven unit discovery in self-supervised learning (SSL) of speech has embarked on a new era of spoken language processing. Yet, the discovered units often remain in phonetic space and speech units beyond phonemes are largely underexplored. Here, we demonstrate that a syllabic organization emerges in learning sentence-level representation of speech. In particular, we adopt "self-distillation" objective to fine-tune the pretrained HuBERT with an aggregator token that summarizes the entire sentence. Without any supervision, the resulting model draws definite boundaries in speech, and the representations across frames exhibit salient syllabic structures. We demonstrate that this emergent structure largely corresponds to the ground truth syllables. Furthermore, we propose a new benchmark task, Spoken Speech ABX, for evaluating sentence-level representation of speech. When compared to previous models, our model outperforms in both unsupervised syllable discovery and learning sentence-level representation. Together, we demonstrate that the self-distillation of HuBERT gives rise to syllabic organization without relying on external labels or modalities, and potentially provides novel data-driven units for spoken language modeling.