Segment-driven Structural Induction and Semantic Alignment for Heterogeneous Tabular Representation

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

This work addresses the challenge of semantic modeling across heterogeneous tables arising from divergent column headers. To this end, the authors propose NAVI, a novel framework that treats “header–value” pairs as fundamental units and introduces segments as the core modeling construct, explicitly distinguishing between stable attributes and instance-specific attributes. NAVI jointly optimizes schema-level structural information and column-level distributional characteristics through masked segment modeling and entropy-driven segment alignment, thereby achieving both structural coupling and semantic consistency. Experimental results demonstrate that NAVI significantly improves data reconstruction quality, semantic coherence, and downstream task performance on heterogeneous yet domain-aligned tabular datasets.

📝 Abstract

Real-world domains often contain heterogeneous tables whose headers vary while their underlying attribute semantics are shared, making it difficult to induce domain-specialized semantics from table-local evidence alone. Existing encoders model parts of this problem, but often underuse column-level value distributions and apply uniform objectives across attributes with different semantic roles. We propose NAVI, a segment-centric pretraining framework that treats each header-value pair as the unit for aggregating schema-level structural evidence and column-level distributional evidence. We realize this design through Masked Segment Modeling and Entropy-driven Segment Alignment, which jointly enforce structured header-value coupling and semantic alignment across stable and instance-specific attributes. Experiments on heterogeneous in-domain tables show improved reconstruction, semantic consistency, and downstream utility across evaluation settings overall.

Problem

Research questions and friction points this paper is trying to address.

heterogeneous tabular data

semantic alignment

structural induction

schema representation

column semantics

Innovation

Methods, ideas, or system contributions that make the work stand out.

segment-centric pretraining

masked segment modeling

entropy-driven alignment