Large-scale entity resolution via microclustering Ewens--Pitman random partitions

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address scalability bottlenecks in large-scale entity resolution arising from microclustering structure—where the number of clusters grows linearly while the maximum cluster size grows sublinearly—this paper introduces the first theoretically grounded stochastic partitioning framework. Our method adapts the Ewens–Pitman random partition model by introducing a sample-size-dependent scaling parameter for the concentration strength, ensuring provable microclustering behavior; it further couples this with a Pitman–Yor process to enable efficient variational inference. The resulting Bayesian entity resolution approach maintains accuracy comparable to state-of-the-art Bayesian methods while accelerating inference by three orders of magnitude. This breakthrough overcomes computational barriers in massive-scale settings and establishes a provably sound, scalable paradigm for real-time entity resolution on high-dimensional, heterogeneous data.

Technology Category

Application Category

📝 Abstract

We introduce the microclustering Ewens--Pitman model for random partitions, obtained by scaling the strength parameter of the Ewens--Pitman model linearly with the sample size. The resulting random partition is shown to have the microclustering property, namely: the size of the largest cluster grows sub-linearly with the sample size, while the number of clusters grows linearly. By leveraging the interplay between the Ewens--Pitman random partition with the Pitman--Yor process, we develop efficient variational inference schemes for posterior computation in entity resolution. Our approach achieves a speed-up of three orders of magnitude over existing Bayesian methods for entity resolution, while maintaining competitive empirical performance.

Problem

Research questions and friction points this paper is trying to address.

Develops microclustering model for large-scale entity resolution

Ensures sub-linear cluster growth with sample size

Achieves fast variational inference for Bayesian methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Microclustering Ewens-Pitman model for partitions

Efficient variational inference for posterior computation

Three orders of magnitude speed-up

🔎 Similar Papers

No similar papers found.

Authors to Follow