Recovering the Zipfian Distribution in Unsupervised Term Discovery

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses a key limitation of mainstream centroid-based clustering methods in unsupervised term discovery: their inductive bias impedes the recovery of vocabulary that follows the Zipfian distribution observed in natural language. To overcome this, the authors propose a bottom-up graph clustering approach that constructs a similarity graph from pairwise embeddings of speech segments and partitions it using the Leiden algorithm to recover word- or syllable-level lexicons better aligned with Zipf’s law. This study provides the first systematic demonstration of graph clustering’s marked advantage in modeling Zipfian distributions, consistently outperforming baseline methods—including K-means, Gaussian Mixture Models, and BIRCH—across three languages. Although average-linkage agglomerative clustering yields comparable performance, it suffers from lower computational efficiency. By challenging the dominance of centroid-based paradigms, this work offers a novel pathway for unsupervised term discovery that more faithfully captures the statistical properties of language.

📝 Abstract

Unsupervised term discovery involves segmenting unlabelled speech into word- or syllable-like units and clustering these into a lexicon of candidate types. True lexicons follow a Zipfian distribution, yet the dominant centre-based clustering approach -- K-means -- produces a more uniform distribution due to an inductive bias toward spherical clusters. In this paper we revisit graph-based clustering as a bottom-up alternative, where segment embeddings are connected by pairwise similarity and partitioned using the Leiden algorithm. We show that graph clustering substantially outperforms centre-based approaches (K-means, GMM, BIRCH) in both word- and syllable-level lexicon discovery across three languages, producing more Zipf-like distributions. Another bottom-up approach, agglomerative clustering with average linkage, also performs well, although it is computationally less efficient and allows for less control over the resulting distribution. Our work calls into question the dominance of centre-based clustering for term discovery, and promotes graph clustering as an attractive alternative.

Problem

Research questions and friction points this paper is trying to address.

unsupervised term discovery

Zipfian distribution

clustering bias

lexical distribution

speech segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

graph-based clustering

Zipfian distribution

unsupervised term discovery