Taxonomy Inference for Tabular Data Using Large Language Models

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the critical challenge of automatic semantic type (concept) identification and hierarchical taxonomy construction for tabular data. We propose a dual-path large language model (LLM) paradigm: EmTT fine-tunes encoder-based LLMs (e.g., BERT) via contrastive learning to derive column-level semantic embeddings and perform unsupervised clustering; GeTT leverages decoder-based LLMs (e.g., GPT-4) with iterative prompting to generate type names and hierarchical relationships, eliminating reliance on predefined schemas. To our knowledge, this is the first work to integrate encoder-decoder collaboration for taxonomy inference in tabular data, jointly optimizing semantic representation fidelity and structural generation flexibility. Evaluated on three real-world datasets across six metrics, our method consistently outperforms state-of-the-art baselines. The automatically induced taxonomies achieve high alignment with human annotations (average F1 = 0.89), significantly advancing automation in data management, ontology learning, and exploratory data analysis.

Technology Category

Application Category

📝 Abstract
Taxonomy inference for tabular data is a critical task of schema inference, aiming at discovering entity types (i.e., concepts) of the tables and building their hierarchy. It can play an important role in data management, data exploration, ontology learning, and many data-centric applications. Existing schema inference systems focus more on XML, JSON or RDF data, and often rely on lexical formats and structures of the data for calculating similarities, with limited exploitation of the semantics of the text across a table. Motivated by recent works on taxonomy completion and construction using Large Language Models (LLMs), this paper presents two LLM-based methods for taxonomy inference for tables: (i) EmTT which embeds columns by fine-tuning with contrastive learning encoder-alone LLMs like BERT and utilises clustering for hierarchy construction, and (ii) GeTT which generates table entity types and their hierarchy by iterative prompting using a decoder-alone LLM like GPT-4. Extensive evaluation on three real-world datasets with six metrics covering different aspects of the output taxonomies has demonstrated that EmTT and GeTT can both produce taxonomies with strong consistency relative to the Ground Truth.
Problem

Research questions and friction points this paper is trying to address.

Inferring entity types and hierarchies for tabular data
Overcoming limitations of lexical-based schema inference methods
Leveraging LLMs for taxonomy construction from tables
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning BERT with contrastive learning for embeddings
Using GPT-4 for iterative prompting to generate taxonomies
Combining clustering and LLMs for hierarchy construction
🔎 Similar Papers
No similar papers found.