🤖 AI Summary
Traditional relational databases rely on exact column-name and value matching, making them ill-suited for cross-heterogeneous database integration where column naming inconsistencies and data fragmentation prevail. To address this, we propose an end-to-end fuzzy join framework: it constructs a weighted graph model wherein edge weights integrate column-name semantic embedding similarity and row-level fuzzy value overlap—quantified via a negative-log-transformed Jaccard score. Multi-hop join paths are then discovered through graph traversal, enabling automated, indirect, and non-equi joins. Unlike conventional single-hop equi-joins, our approach supports complex, semantics-aware linkage across heterogeneous schemas. Evaluated on synthetic healthcare databases, the method accurately recovers correct join relationships under column-name obfuscation and partial value mismatches, significantly improving data connectability and integration efficiency in complex heterogeneous environments.
📝 Abstract
Traditional relational databases require users to manually specify join keys and assume exact matches between column names and values. In practice, this limits joinability across fragmented or inconsistently named tables. We propose a fuzzy join framework that automatically identifies joinable column pairs and traverses indirect (multi-hop) join paths across multiple databases. Our method combines column name similarity with row-level fuzzy value overlap, computes edge weights using negative log-transformed Jaccard scores, and performs join path discovery via graph traversal. Experiments on synthetic healthcare-style databases demonstrate the system's ability to recover valid joins despite fuzzified column names and partial value mismatches. This research has direct applications in data integration.