The Pneuma Project: Reifying Information Needs as Relational Schemas to Automate Discovery, Guide Preparation, and Align Data with Intent

📅 2026-01-07
🏛️ Conference on Innovative Data Systems Research
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Pneuma-Seeker, a system designed to address the efficiency bottlenecks in data discovery and preparation caused by user intents that are often ambiguous, dynamic, and difficult to operationalize. Pneuma-Seeker models user information needs as relational data schemas and integrates retrieval-augmented generation (RAG), agent-based architectures, and structured data preparation techniques. Through context specialization, an imperative planner, and a shared-state convergence mechanism, the system enables intent-driven, semi-automated data workflows. Leveraging large language models for iterative interaction and intent simulation, it dynamically generates usable documentation aligned with user intent and incrementally builds a reusable organizational knowledge layer. This approach significantly enhances both the efficiency of data discovery and the operationalizability of user intent.

Technology Category

Application Category

📝 Abstract
Data discovery and preparation remain persistent bottlenecks in the data management lifecycle, especially when user intent is vague, evolving, or difficult to operationalize. The Pneuma Project introduces Pneuma-Seeker, a system that helps users articulate and fulfill information needs through iterative interaction with a language model-powered platform. The system reifies the user's evolving information need as a relational data model and incrementally converges toward a usable document aligned with that intent. To achieve this, the system combines three architectural ideas: context specialization to reduce LLM burden across subtasks, a conductor-style planner to assemble dynamic execution plans, and a convergence mechanism based on shared state. The system integrates recent advances in retrieval-augmented generation (RAG), agentic frameworks, and structured data preparation to support semi-automatic, language-guided workflows. We evaluate the system through LLM-based user simulations and show that it helps surface latent intent, guide discovery, and produce fit-for-purpose documents. It also acts as an emergent documentation layer, capturing institutional knowledge and supporting organizational memory.
Problem

Research questions and friction points this paper is trying to address.

data discovery
data preparation
user intent
information needs
relational schemas
Innovation

Methods, ideas, or system contributions that make the work stand out.

relational schema reification
conductor-style planner
context specialization
intent-driven data preparation
shared-state convergence
🔎 Similar Papers
No similar papers found.
M
Muhammad Imam Luthfi Balaka
The University of Chicago
Raul Castro Fernandez
Raul Castro Fernandez
The University of Chicago
DataSystems