🤖 AI Summary
To address the challenges of data discovery and cross-source analytics in multimodal data lakes—encompassing structured, semi-structured, and unstructured data—this paper proposes an end-to-end declarative query system. Methodologically, it introduces a unified multimodal operator interface that extends classical relational operations with AI-enhanced capabilities; formalizes a holistic query model spanning data discovery, planning, execution, and result aggregation; and systematically integrates large language models (LLMs) for schema inference, semantic matching, and cross-modal query rewriting—the first such comprehensive application in this domain. Contributions include: (1) enabling joint exploration and analysis across heterogeneous data sources; (2) empirically validating LLMs’ effectiveness in multimodal data understanding and query optimization; and (3) establishing a scalable theoretical framework accompanied by a functional prototype system.
📝 Abstract
Querying and exploring massive collections of data sources, such as data lakes, has been an essential research topic in the database community. Although many efforts have been paid in the field of data discovery and data integration in data lakes, they mainly focused on the scenario where the data lake consists of structured tables. However, real-world enterprise data lakes are always more complicated, where there might be silos of multi-modal data sources with structured, semi-structured and unstructured data. In this paper, we envision an end-to-end system with declarative interface for querying and analyzing the multi-modal data lakes. First of all, we come up with a set of multi-modal operators, which is a unified interface that extends the relational operations with AI-composed ones to express analytical workloads over data sources in various modalities. In addition, we formally define the essential steps in the system, such as data discovery, query planning, query processing and results aggregation. On the basis of it, we then pinpoint the research challenges and discuss potential opportunities in realizing and optimizing them with advanced techniques brought by Large Language Models. Finally, we demonstrate our preliminary attempts to address this problem and suggest the future plan for this research topic.