🤖 AI Summary
This work addresses the end-to-end automatic construction of relational databases from unstructured text. We propose the first neural-symbolic, four-stage framework: schema recognition → constraint inference → table generation → data population. The framework synergistically integrates large language models (LLMs) with symbolic rule engines: LLMs perform semantic parsing via prompt engineering, while constraint solvers and iterative validation ensure logical consistency and structural correctness. Evaluated on multi-domain benchmarks, our approach achieves a 27% absolute improvement in schema correctness over state-of-the-art methods and attains 89.4% accuracy in data population. These results demonstrate substantial gains in semantic fidelity and structural rigor for database synthesis, advancing the automation of relational schema and instance generation from natural language text.
📝 Abstract
Relational databases are central to modern data management, yet most data exists in unstructured forms like text documents. To bridge this gap, we leverage large language models (LLMs) to automatically synthesize a relational database by generating its schema and populating its tables from raw text. We introduce SQUiD, a novel neurosymbolic framework that decomposes this task into four stages, each with specialized techniques. Our experiments show that SQUiD consistently outperforms baselines across diverse datasets.