Evaluating Knowledge Generation and Self-Refinement Strategies for LLM-based Column Type Annotation

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

134K/year

🤖 AI Summary

This work addresses column semantic understanding in data lake indexing, specifically evaluating the effectiveness and efficiency of large language models (LLMs) for Column Type Annotation (CTA). We propose a synergistic framework integrating term definition generation, error-driven self-refinement, self-correction, in-context learning, and supervised fine-tuning. Our empirical study yields three key findings: (1) self-refined definitions improve F1 by 3.9% on average over original definitions; (2) definition-based prompting outperforms example-based demonstration; and (3) combining fine-tuned models with self-refined definitions achieves optimal performance—yielding ≥3% F1 gain over zero-shot prompting. Further analysis reveals that self-refinement is more cost-effective in few-shot settings, whereas supervised fine-tuning excels with large-scale labeled data. Collectively, this work establishes a reusable methodology and empirical benchmark for leveraging LLMs in structured data semantic understanding.

Technology Category

Application Category

📝 Abstract

Understanding the semantics of columns in relational tables is an important pre-processing step for indexing data lakes in order to provide rich data search. An approach to establishing such understanding is column type annotation (CTA) where the goal is to annotate table columns with terms from a given vocabulary. This paper experimentally compares different knowledge generation and self-refinement strategies for LLM-based column type annotation. The strategies include using LLMs to generate term definitions, error-based refinement of term definitions, self-correction, and fine-tuning using examples and term definitions. We evaluate these strategies along two dimensions: effectiveness measured as F1 performance and efficiency measured in terms of token usage and cost. Our experiments show that the best performing strategy depends on the model/dataset combination. We find that using training data to generate label definitions outperforms using the same data as demonstrations for in-context learning for two out of three datasets using OpenAI models. The experiments further show that using the LLMs to refine label definitions brings an average increase of 3.9% F1 in 10 out of 12 setups compared to the performance of the non-refined definitions. Combining fine-tuned models with self-refined term definitions results in the overall highest performance, outperforming zero-shot prompting fine-tuned models by at least 3% in F1 score. The costs analysis shows that while reaching similar F1 score, self-refinement via prompting is more cost efficient for use cases requiring smaller amounts of tables to be annotated while fine-tuning is more efficient for large amounts of tables.

Problem

Research questions and friction points this paper is trying to address.

Compares strategies for LLM-based column type annotation.

Evaluates effectiveness and efficiency of knowledge generation.

Assesses cost efficiency of self-refinement and fine-tuning methods.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs generate and refine term definitions

Fine-tuning with self-refined definitions enhances performance

Self-refinement via prompting is cost-efficient for small datasets

🔎 Similar Papers

No similar papers found.