🤖 AI Summary
To address scalability bottlenecks in large-scale entity resolution arising from microclustering structure—where the number of clusters grows linearly while the maximum cluster size grows sublinearly—this paper introduces the first theoretically grounded stochastic partitioning framework. Our method adapts the Ewens–Pitman random partition model by introducing a sample-size-dependent scaling parameter for the concentration strength, ensuring provable microclustering behavior; it further couples this with a Pitman–Yor process to enable efficient variational inference. The resulting Bayesian entity resolution approach maintains accuracy comparable to state-of-the-art Bayesian methods while accelerating inference by three orders of magnitude. This breakthrough overcomes computational barriers in massive-scale settings and establishes a provably sound, scalable paradigm for real-time entity resolution on high-dimensional, heterogeneous data.
📝 Abstract
We introduce the microclustering Ewens--Pitman model for random partitions, obtained by scaling the strength parameter of the Ewens--Pitman model linearly with the sample size. The resulting random partition is shown to have the microclustering property, namely: the size of the largest cluster grows sub-linearly with the sample size, while the number of clusters grows linearly. By leveraging the interplay between the Ewens--Pitman random partition with the Pitman--Yor process, we develop efficient variational inference schemes for posterior computation in entity resolution. Our approach achieves a speed-up of three orders of magnitude over existing Bayesian methods for entity resolution, while maintaining competitive empirical performance.