Segment-driven Structural Induction and Semantic Alignment for Heterogeneous Tabular Representation

πŸ“… 2026-06-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

146K/year
πŸ€– AI Summary
This work addresses the challenge of semantic modeling across heterogeneous tables arising from divergent column headers. To this end, the authors propose NAVI, a novel framework that treats β€œheader–value” pairs as fundamental units and introduces segments as the core modeling construct, explicitly distinguishing between stable attributes and instance-specific attributes. NAVI jointly optimizes schema-level structural information and column-level distributional characteristics through masked segment modeling and entropy-driven segment alignment, thereby achieving both structural coupling and semantic consistency. Experimental results demonstrate that NAVI significantly improves data reconstruction quality, semantic coherence, and downstream task performance on heterogeneous yet domain-aligned tabular datasets.
πŸ“ Abstract
Real-world domains often contain heterogeneous tables whose headers vary while their underlying attribute semantics are shared, making it difficult to induce domain-specialized semantics from table-local evidence alone. Existing encoders model parts of this problem, but often underuse column-level value distributions and apply uniform objectives across attributes with different semantic roles. We propose NAVI, a segment-centric pretraining framework that treats each header-value pair as the unit for aggregating schema-level structural evidence and column-level distributional evidence. We realize this design through Masked Segment Modeling and Entropy-driven Segment Alignment, which jointly enforce structured header-value coupling and semantic alignment across stable and instance-specific attributes. Experiments on heterogeneous in-domain tables show improved reconstruction, semantic consistency, and downstream utility across evaluation settings overall.
Problem

Research questions and friction points this paper is trying to address.

heterogeneous tabular data
semantic alignment
structural induction
schema representation
column semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

segment-centric pretraining
masked segment modeling
entropy-driven alignment
heterogeneous tabular representation
semantic alignment