🤖 AI Summary
This study addresses the problem of improving the accuracy of discrete distribution estimation by leveraging semantic and other auxiliary information. It presents the first systematic modeling and theoretical quantification of how such auxiliary information reduces squared error risk under two settings: a local model, where the unknown distribution is close to a known prior, and a partial-order model, where the alphabet is partitioned into high- and low-probability subsets. The proposed framework integrates statistical learning theory, risk analysis, and semantic representations—such as word embeddings—to construct distributional neighborhoods and partial-order structures. Experiments on both natural language and synthetic data demonstrate that the approach significantly lowers estimation error, with empirical results closely aligning with theoretical predictions.
📝 Abstract
We consider the classical problem of discrete distribution estimation using i.i.d. samples in a novel scenario where additional side information is available on the distribution. In large alphabet datasets such as text corpora, such side information arises naturally through word semantics/similarities that can be inferred by closeness of vector word embeddings, for instance. We consider two specific models for side information--a local model where the unknown distribution is in the neighborhood of a known distribution, and a partial ordering model where the alphabet is partitioned into known higher and lower probability sets. In both models, we theoretically characterize the improvement in a suitable squared-error risk because of the available side information. Simulations over natural language and synthetic data illustrate these gains.