Data Analysis Prediction over Multiple Unseen Datasets: A Vector Embedding Approach

πŸ“… 2025-02-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the challenge of predicting analytical operator outcomes on unseen, massive heterogeneous datasets, this paper introduces NumTabData2Vecβ€”the first end-to-end dataset-level vectorization model. It maps raw tabular data to low-dimensional semantic embeddings, enabling cross-dataset semantic similarity search and analytical result inference. By jointly modeling data structure, statistical features, and operator semantics, NumTabData2Vec achieves high-accuracy outcome prediction on previously unseen, real-world multi-source datasets. Evaluated on multiple real-world benchmarks, it significantly outperforms state-of-the-art methods in prediction accuracy, reduces execution latency by over an order of magnitude, and effectively discriminates among diverse practical scenarios. This work establishes a scalable meta-learning paradigm for large-scale data analysis, offering robust generalization across heterogeneous datasets without requiring task-specific fine-tuning.

Technology Category

Application Category

πŸ“ Abstract
The massive increase in the data volume and dataset availability for analysts compels researchers to focus on data content and select high-quality datasets to enhance the performance of analytics operators. While selecting the highest quality data for analysis highly increases task accuracy and efficiency, it is still a hard task, especially when the number of available inputs is very large. To address this issue, we propose a novel methodology that infers the outcome of analytics operators by creating a model from datasets similar to the queried one. Dataset similarity is performed via projecting each dataset to a vector embedding representation. The vectorization process is performed using our proposed deep learning model NumTabData2Vec, which takes a whole dataset and projects it into a lower vector embedding representation space. Through experimental evaluation, we compare the prediction performance and the execution time of our framework to another state-of-the-art modelling operator framework, illustrating that our approach predicts analytics outcomes accurately. Furthermore, our vectorization model can project different real-world scenarios to a lower vector embedding representation and distinguish between them.
Problem

Research questions and friction points this paper is trying to address.

Predict analytics outcomes across datasets
Enhance dataset selection efficiency
Vector embedding for dataset similarity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vector embedding for data similarity
Deep learning model NumTabData2Vec
Accurate prediction of analytics outcomes
πŸ”Ž Similar Papers
No similar papers found.