🤖 AI Summary
This work addresses the challenge of spurious associations in table export scenarios, where inter-column relationships are often lost and existing mining methods are prone to coincidental or biased patterns. To tackle this, the paper proposes a unified modeling framework based on functional relationships (FRs) that captures arithmetic operations, string transformations, and functional dependencies. It introduces a novel “mine-and-verify” paradigm enhanced with grouping-based lower-bound pruning, closure-driven acceleration, and binomial early stopping for efficiency. Crucially, the study establishes four reliability criteria—accuracy, atomicity, stability, and completeness—and designs three statistical validation mechanisms: Minimality, Perturbation, and Independence tests, to ensure discovered FRs are genuine. Evaluated on a large-scale benchmark comprising 58,679 tables and 6,414 annotated FRs, the method achieves a PR-AUC of 0.87, outperforming the best baseline by an average of 59%.
📝 Abstract
Tables in spreadsheets, computational notebooks, and databases often contain rich inter-column relationships. Yet these relationships are typically implicit and are often lost when tables are exported to standard formats. Recovering them can benefit downstream tasks, including table understanding, data quality improvement, and provenance analysis. However, simply mining relationships that hold on an observed table is insufficient, as many are spurious due to coincidence, redundancy, or limited data diversity. In this paper, we introduce functional relationships (FRs) as a unified notion for inter-column relationships in tables, subsuming arithmetic relationships, string transformations, and functional dependencies. We characterize FR reliability through four complementary criteria: accuracy, atomicity, stability, and integrity. Guided by these criteria, we propose Auto-Relate, a mine-then-verify framework that first generates accurate candidate FRs and then verifies the remaining reliability criteria through a Minimality Test, a Perturbation Test, and an Independence Test, respectively. To further improve efficiency, we develop three optimization strategies, including a group-by lower bound for early rejection, a closed-form speedup for arithmetic FRs, and a binomial bound for statistically guided early termination. We construct a large-scale benchmark suite from 58,679 real-world spreadsheets and relational tables, containing 6,414 ground-truth FRs spanning all three FR types. Extensive experiments against 18 baselines show that Auto-Relate consistently achieves the best performance, with an average PR-AUC of 0.87, 59% higher than the best competing baseline across all settings.