Large-scale entity resolution via microclustering Ewens--Pitman random partitions

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address scalability bottlenecks in large-scale entity resolution arising from microclustering structure—where the number of clusters grows linearly while the maximum cluster size grows sublinearly—this paper introduces the first theoretically grounded stochastic partitioning framework. Our method adapts the Ewens–Pitman random partition model by introducing a sample-size-dependent scaling parameter for the concentration strength, ensuring provable microclustering behavior; it further couples this with a Pitman–Yor process to enable efficient variational inference. The resulting Bayesian entity resolution approach maintains accuracy comparable to state-of-the-art Bayesian methods while accelerating inference by three orders of magnitude. This breakthrough overcomes computational barriers in massive-scale settings and establishes a provably sound, scalable paradigm for real-time entity resolution on high-dimensional, heterogeneous data.

Technology Category

Application Category

📝 Abstract
We introduce the microclustering Ewens--Pitman model for random partitions, obtained by scaling the strength parameter of the Ewens--Pitman model linearly with the sample size. The resulting random partition is shown to have the microclustering property, namely: the size of the largest cluster grows sub-linearly with the sample size, while the number of clusters grows linearly. By leveraging the interplay between the Ewens--Pitman random partition with the Pitman--Yor process, we develop efficient variational inference schemes for posterior computation in entity resolution. Our approach achieves a speed-up of three orders of magnitude over existing Bayesian methods for entity resolution, while maintaining competitive empirical performance.
Problem

Research questions and friction points this paper is trying to address.

Develops microclustering model for large-scale entity resolution
Ensures sub-linear cluster growth with sample size
Achieves fast variational inference for Bayesian methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Microclustering Ewens-Pitman model for partitions
Efficient variational inference for posterior computation
Three orders of magnitude speed-up
🔎 Similar Papers
No similar papers found.