🤖 AI Summary
This work proposes Pneuma-Seeker, a system designed to address the efficiency bottlenecks in data discovery and preparation caused by user intents that are often ambiguous, dynamic, and difficult to operationalize. Pneuma-Seeker models user information needs as relational data schemas and integrates retrieval-augmented generation (RAG), agent-based architectures, and structured data preparation techniques. Through context specialization, an imperative planner, and a shared-state convergence mechanism, the system enables intent-driven, semi-automated data workflows. Leveraging large language models for iterative interaction and intent simulation, it dynamically generates usable documentation aligned with user intent and incrementally builds a reusable organizational knowledge layer. This approach significantly enhances both the efficiency of data discovery and the operationalizability of user intent.
📝 Abstract
Data discovery and preparation remain persistent bottlenecks in the data management lifecycle, especially when user intent is vague, evolving, or difficult to operationalize. The Pneuma Project introduces Pneuma-Seeker, a system that helps users articulate and fulfill information needs through iterative interaction with a language model-powered platform. The system reifies the user's evolving information need as a relational data model and incrementally converges toward a usable document aligned with that intent. To achieve this, the system combines three architectural ideas: context specialization to reduce LLM burden across subtasks, a conductor-style planner to assemble dynamic execution plans, and a convergence mechanism based on shared state. The system integrates recent advances in retrieval-augmented generation (RAG), agentic frameworks, and structured data preparation to support semi-automatic, language-guided workflows. We evaluate the system through LLM-based user simulations and show that it helps surface latent intent, guide discovery, and produce fit-for-purpose documents. It also acts as an emergent documentation layer, capturing institutional knowledge and supporting organizational memory.