π€ AI Summary
To address the challenge of predicting analytical operator outcomes on unseen, massive heterogeneous datasets, this paper introduces NumTabData2Vecβthe first end-to-end dataset-level vectorization model. It maps raw tabular data to low-dimensional semantic embeddings, enabling cross-dataset semantic similarity search and analytical result inference. By jointly modeling data structure, statistical features, and operator semantics, NumTabData2Vec achieves high-accuracy outcome prediction on previously unseen, real-world multi-source datasets. Evaluated on multiple real-world benchmarks, it significantly outperforms state-of-the-art methods in prediction accuracy, reduces execution latency by over an order of magnitude, and effectively discriminates among diverse practical scenarios. This work establishes a scalable meta-learning paradigm for large-scale data analysis, offering robust generalization across heterogeneous datasets without requiring task-specific fine-tuning.
π Abstract
The massive increase in the data volume and dataset availability for analysts compels researchers to focus on data content and select high-quality datasets to enhance the performance of analytics operators. While selecting the highest quality data for analysis highly increases task accuracy and efficiency, it is still a hard task, especially when the number of available inputs is very large. To address this issue, we propose a novel methodology that infers the outcome of analytics operators by creating a model from datasets similar to the queried one. Dataset similarity is performed via projecting each dataset to a vector embedding representation. The vectorization process is performed using our proposed deep learning model NumTabData2Vec, which takes a whole dataset and projects it into a lower vector embedding representation space. Through experimental evaluation, we compare the prediction performance and the execution time of our framework to another state-of-the-art modelling operator framework, illustrating that our approach predicts analytics outcomes accurately. Furthermore, our vectorization model can project different real-world scenarios to a lower vector embedding representation and distinguish between them.