Stop Relearning: Model Reuse via Feature Distribution Analysis for Incremental Entity Resolution

📅 2024-12-12
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges in multi-source entity resolution (MS-ER)—including poor model reusability under heterogeneous data, high computational overhead in incremental training, blind cross-source model transfer, and low accuracy in threshold-based matching—this paper proposes a lightweight incremental resolution framework grounded in feature distribution similarity. We introduce a Wasserstein-distance-driven model selection mechanism, enabling interpretable and controllable cross-source transfer. Furthermore, we design a retraining-free model reuse paradigm that integrates active labeling with adaptability assessment to ensure incremental stability. Experimental results demonstrate that, at comparable matching quality, our method achieves 48× higher efficiency than state-of-the-art multi-source active learning approaches and 163× higher efficiency than conventional transfer learning methods, significantly reducing both annotation effort and training cost.

Technology Category

Application Category

📝 Abstract
Entity resolution is essential for data integration, facilitating analytics and insights from complex systems. Multi-source and incremental entity resolution address the challenges of integrating diverse and dynamic data, which is common in real-world scenarios. A critical question is how to classify matches and non-matches among record pairs from new and existing data sources. Traditional threshold-based methods often yield lower quality than machine learning (ML) approaches, while incremental methods may lack stability depending on the order in which new data is integrated. Additionally, reusing training data and existing models for new data sources is unresolved for multi-source entity resolution. Even the approach of transfer learning does not consider the challenge of which source domain should be used to transfer model and training data information for a certain target domain. Naive strategies for training new models for each new linkage problem are inefficient. This work addresses these challenges and focuses on creating as well as managing models with a small labeling effort and the selection of suitable models for new data sources based on feature distributions. The results of our method StoRe demonstrate that our approach achieves comparable qualitative results. Regarding efficiency, StoRe outperforms both a multi-source active learning and a transfer learning approach, achieving efficiency improvements of up to 48 times faster than the active learning approach and by a factor of 163 compared to the transfer learning method.
Problem

Research questions and friction points this paper is trying to address.

Addresses multi-source entity resolution scalability and heterogeneity challenges
Enables model reuse across entity resolution tasks without retraining
Reduces labeling effort through feature distribution analysis and clustering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clusters similar ER tasks via feature distribution analysis
Initializes model repository with moderate labeling effort
Enables continuous integration of new data sources efficiently
🔎 Similar Papers
No similar papers found.