🤖 AI Summary
This study addresses a critical limitation in existing evaluation metrics for unsupervised word discovery: their inherent bias toward large clusters and neglect of the natural dispersion of true word types across multiple clusters, which distorts assessment outcomes. To remedy this, the work proposes two novel clustering-theoretic metrics. First, a cluster-size-weighted consistency measure mitigates size-related bias; second, an inverse metric explicitly captures the cross-cluster dispersion of ground-truth words. Experimental validation using normalized edit distance on both synthetic and real speech data demonstrates that the combined use of these metrics substantially improves correlation with the true lexical distribution and effectively overcomes the systematic biases inherent in conventional evaluation approaches.
📝 Abstract
Building a lexicon from discovered word-like units is a central goal in zero-resource speech processing. But do our evaluations provide a trustworthy indication of lexicon quality? A common metric, normalized edit distance, averages the phoneme edit distances between discovered units in each cluster. We show that this metric has an inherent bias toward the quality of large clusters, inhibiting fair evaluation. Moreover, it ignores how well true classes are distributed across clusters. Based on established theory in clustering literature, we propose two metrics that address these shortcomings: a modified metric that weighs cluster size when assessing within-cluster consistency, and an inverse metric that assesses how true words are spread across clusters. Through experiments on synthetic and real-world lexicons, we demonstrate that combined, these metrics are: (1) more closely correlated with how similar a lexicon is to the ground-truth distribution, and (2) more robust to biases that skew lexicon evaluations.