🤖 AI Summary
This work addresses the challenge of integrating and retrieving multi-source heterogeneous data arising from schema inconsistencies by proposing an “executable schema contract” mechanism. This approach enables structure-aware automatic knowledge graph construction through a combination of closed-world field catalogs, deterministic structural analysis (e.g., primary/foreign key detection and source hierarchy identification), and monotonic extension protocols. It integrates large language model–constrained schema discovery, schema-guided information extraction and deduplication, and a multi-tool agent routing strategy that supports structured queries, graph traversal, and vector search. Evaluated on four question-answering benchmarks, the method achieves significantly superior zero-shot performance compared to pure retrieval and decomposition-based baselines. Ablation studies confirm that schema-conditioned routing, structural reasoning, and schema-guided construction are critical to its performance gains.
📝 Abstract
Real-world data spans tables, documents, and semi-structured files with implicit semantics. Querying this data requires integrating evidence across inconsistent schemas and formats, yet existing approaches either demand costly manual engineering or bypass structure entirely. We present a system that automatically discovers an executable schema from raw multi-source data and uses it as a shared contract for knowledge graph construction and query-time retrieval. A closed-world field catalog constrains LLM-based schema discovery to attested fields; deterministic structural analysis infers identity keys, foreign keys, and source hierarchy; and the resulting schema drives extraction, deduplication, and cross-source linking into a provenance-aware knowledge graph. At query time the schema -- optionally extended via a monotonic protocol -- conditions a multi-tool agent routing retrieval across structured lookup, graph traversal, and vector search, returning grounded answers with traceable citations. In controlled zero-shot comparisons using the same LLM, data, and evaluation harness, the system improves over retrieval-only and decomposition-based baselines across four QA benchmarks, with ablations showing that schema-conditioned routing, structural intelligence, and schema-guided construction each contribute to the gains.