JOINT: Join Optimization and Inference via Network Traversal

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional relational databases rely on exact column-name and value matching, making them ill-suited for cross-heterogeneous database integration where column naming inconsistencies and data fragmentation prevail. To address this, we propose an end-to-end fuzzy join framework: it constructs a weighted graph model wherein edge weights integrate column-name semantic embedding similarity and row-level fuzzy value overlap—quantified via a negative-log-transformed Jaccard score. Multi-hop join paths are then discovered through graph traversal, enabling automated, indirect, and non-equi joins. Unlike conventional single-hop equi-joins, our approach supports complex, semantics-aware linkage across heterogeneous schemas. Evaluated on synthetic healthcare databases, the method accurately recovers correct join relationships under column-name obfuscation and partial value mismatches, significantly improving data connectability and integration efficiency in complex heterogeneous environments.

Technology Category

Application Category

📝 Abstract
Traditional relational databases require users to manually specify join keys and assume exact matches between column names and values. In practice, this limits joinability across fragmented or inconsistently named tables. We propose a fuzzy join framework that automatically identifies joinable column pairs and traverses indirect (multi-hop) join paths across multiple databases. Our method combines column name similarity with row-level fuzzy value overlap, computes edge weights using negative log-transformed Jaccard scores, and performs join path discovery via graph traversal. Experiments on synthetic healthcare-style databases demonstrate the system's ability to recover valid joins despite fuzzified column names and partial value mismatches. This research has direct applications in data integration.
Problem

Research questions and friction points this paper is trying to address.

Automating join key identification across fragmented databases
Handling fuzzy column name and value mismatches in joins
Discovering multi-hop join paths via graph traversal methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuzzy join framework for automatic column matching
Combines name similarity with value overlap scoring
Graph traversal for multi-hop join path discovery
🔎 Similar Papers
2024-05-10Proceedings of the VLDB EndowmentCitations: 2
2024-04-15Annual Meeting of the Association for Computational LinguisticsCitations: 4
S
Szu-Yun Ko
Department of Information Management, National Taiwan University, Taiwan
B
Bo-Cian Chang
Department of Economics, National Taiwan University, Taiwan
A
Alan Shu-Luen Chang
Department of Civil Engineering, National Taiwan University, Taiwan
Ethan Chen
Ethan Chen
University of Rochester
Computer Science