π€ AI Summary
This work addresses the limitations of purely speech-based language models, which are often hindered by lengthy discrete token sequences produced by self-supervised encoders, and proposes ZeroSylβan entirely training-free method for constructing syllable-level representations. ZeroSyl leverages the L2 norm of intermediate features from a frozen WavLM model to detect syllable boundaries, followed by mean pooling and K-means clustering to generate syllable embeddings for downstream speech language modeling. By eliminating multi-stage training pipelines, this approach achieves, for the first time, fully training-agnostic syllable segmentation and embedding extraction. Experimental results demonstrate that ZeroSyl outperforms existing syllable-based tokenization methods across lexical, syntactic, and narrative benchmarks, while also exhibiting superior scalability in syntactic modeling.
π Abstract
Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM's intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.