Heterogeneity in Entity Matching: A Survey and Experimental Analysis

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses heterogeneous entity matching (HEM), a core challenge in data integration, arising from structural, syntactic, schema, and semantic heterogeneity. To tackle the resulting modeling difficulties, we propose the first unified classification framework that jointly accounts for representational and semantic heterogeneity, grounded in the FAIR principles to expose fundamental limitations of existing methods under semantic inconsistency. Through a systematic literature review, taxonomy-driven modeling, and cross-model experimental evaluation, we empirically demonstrate the robustness deficiencies of mainstream entity matching models in semantically heterogeneous settings. Our analysis identifies key research directions—including multimodal fusion, human-in-the-loop approaches, joint modeling with large language models and knowledge graphs—as critical for advancing HEM. The work establishes a theoretical foundation, provides a standardized evaluation benchmark, and outlines a principled technical roadmap for future HEM research.

Technology Category

Application Category

📝 Abstract
Entity matching (EM) is a fundamental task in data integration and analytics, essential for identifying records that refer to the same real-world entity across diverse sources. In practice, datasets often differ widely in structure, format, schema, and semantics, creating substantial challenges for EM. We refer to this setting as Heterogeneous EM (HEM). This survey offers a unified perspective on HEM by introducing a taxonomy, grounded in prior work, that distinguishes two primary categories -- representation and semantic heterogeneity -- and their subtypes. The taxonomy provides a systematic lens for understanding how variations in data form and meaning shape the complexity of matching tasks. We then connect this framework to the FAIR principles -- Findability, Accessibility, Interoperability, and Reusability -- demonstrating how they both reveal the challenges of HEM and suggest strategies for mitigating them. Building on this foundation, we critically review recent EM methods, examining their ability to address different heterogeneity types, and conduct targeted experiments on state-of-the-art models to evaluate their robustness and adaptability under semantic heterogeneity. Our analysis uncovers persistent limitations in current approaches and points to promising directions for future research, including multimodal matching, human-in-the-loop workflows, deeper integration with large language models and knowledge graphs, and fairness-aware evaluation in heterogeneous settings.
Problem

Research questions and friction points this paper is trying to address.

Addressing heterogeneity challenges in Entity Matching (EM) across diverse datasets
Classifying representation and semantic heterogeneity types in EM tasks
Evaluating robustness of EM methods under semantic heterogeneity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Taxonomy for representation and semantic heterogeneity
FAIR principles for HEM challenges and strategies
Experiments on state-of-the-art models robustness
🔎 Similar Papers
No similar papers found.