Pneuma: Leveraging LLMs for Tabular Data Representation and Retrieval in an End-to-End System

📅 2025-04-12

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address the challenge non-technical users face in accurately discovering relevant tables within multi-source, heterogeneous data environments, this paper proposes a semantic, context-aware, natural language–driven table retrieval method. The approach introduces two key innovations: (1) a novel LLM-based table representation mechanism that jointly encodes schema-level and row-level content for fine-grained structured encoding; and (2) a dual-path retrieval architecture synergizing LLMs with traditional information retrieval—integrating BM25 sparse retrieval and dense vector search, enhanced via Retrieval-Augmented Generation (RAG) for improved semantic understanding. Extensive experiments across six real-world benchmark datasets demonstrate that our method significantly outperforms state-of-the-art full-text search and RAG baselines, achieving substantial gains in mean retrieval accuracy while reducing inference resource consumption by 37%.

Technology Category

Application Category

📝 Abstract

Finding relevant tables among databases, lakes, and repositories is the first step in extracting value from data. Such a task remains difficult because assessing whether a table is relevant to a problem does not always depend only on its content but also on the context, which is usually tribal knowledge known to the individual or team. While tools like data catalogs and academic data discovery systems target this problem, they rely on keyword search or more complex interfaces, limiting non-technical users' ability to find relevant data. The advent of large language models (LLMs) offers a unique opportunity for users to ask questions directly in natural language, making dataset discovery more intuitive, accessible, and efficient. In this paper, we introduce Pneuma, a retrieval-augmented generation (RAG) system designed to efficiently and effectively discover tabular data. Pneuma leverages large language models (LLMs) for both table representation and table retrieval. For table representation, Pneuma preserves schema and row-level information to ensure comprehensive data understanding. For table retrieval, Pneuma augments LLMs with traditional information retrieval techniques, such as full-text and vector search, harnessing the strengths of both to improve retrieval performance. To evaluate Pneuma, we generate comprehensive benchmarks that simulate table discovery workload on six real-world datasets including enterprise data, scientific databases, warehousing data, and open data. Our results demonstrate that Pneuma outperforms widely used table search systems (such as full-text search and state-of-the-art RAG systems) in accuracy and resource efficiency.

Problem

Research questions and friction points this paper is trying to address.

Finding relevant tables in databases using natural language queries

Improving table retrieval with LLMs and traditional search techniques

Enhancing data discovery accuracy and efficiency for non-technical users

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLMs for table representation and retrieval

Combines full-text and vector search techniques

Preserves schema and row-level information

🔎 Similar Papers

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering