Zero-Execution Retrieval-Augmented Configuration Tuning of Spark Applications

📅 2025-03-05

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Spark ad hoc queries face significant challenges in automatic configuration tuning prior to their first execution, as no runtime information is available during deployment. Method: This paper proposes the first retrieval-based, zero-execution tuning paradigm—enabling precise recommendation of optimal configuration parameters without any actual query execution. We construct the largest publicly available Spark configuration-performance benchmark dataset to date, integrating large-scale workload feature extraction, configuration similarity modeling, and retrieval-augmented matching. Contribution/Results: Compared to state-of-the-art single-execution tuning methods, our approach achieves a 93.3% end-to-end performance improvement and reduces cumulative latency over the first 140 queries. It delivers particularly strong gains for one-time analytical workloads, effectively eliminating the cold-start overhead inherent in ad hoc query execution.

Technology Category

Application Category

📝 Abstract

Large-scale data processing is increasingly done using distributed computing frameworks like Apache Spark, which have a considerable number of configurable parameters that affect runtime performance. For optimal performance, these parameters must be tuned to the specific job being run. Tuning commonly requires multiple executions to collect runtime information for updating parameters. This is infeasible for ad hoc queries that are run once or infrequently. Zero-execution tuning, where parameters are automatically set before a job's first run, can provide significant savings for all types of applications, but is more challenging since runtime information is not available. In this work, we propose a novel method for zero-execution tuning of Spark configurations based on retrieval. Our method achieves 93.3% of the runtime improvement of state-of-the-art one-execution optimization, entirely avoiding the slow initial execution using default settings. The shift to zero-execution tuning results in a lower cumulative runtime over the first 140 runs, and provides the largest benefit for ad hoc and analytical queries which only need to be executed once. We release the largest and most comprehensive suite of Spark query datasets, optimal configurations, and runtime information, which will promote future development of zero-execution tuning methods.

Problem

Research questions and friction points this paper is trying to address.

Automates Spark configuration tuning without initial execution.

Improves runtime performance for ad hoc and analytical queries.

Provides a dataset for future zero-execution tuning research.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-execution tuning for Spark applications

Retrieval-based configuration optimization method

Avoids initial execution with default settings

🔎 Similar Papers

E2ETune: End-to-End Knob Tuning via Fine-tuned Generative Language Model