Predicting Lakehouse Performance in Clouds: An Empirical Exploration of Query Runtime Variance

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This study addresses the significant runtime variability of analytical queries in lakehouse systems, which severely undermines the accuracy of performance prediction and the efficacy of carbon-aware scheduling. For the first time, it systematically quantifies query runtime variance in multi-cloud environments, identifying key contributing factors such as data locality, co-located workloads, and caching effects. The authors construct a Kubernetes-based experimental platform to conduct factorial analysis, revealing that identical queries can exhibit nearly twofold differences in execution time. By mitigating these sources of variance, the work demonstrates up to an 80% reduction in performance prediction error, substantially lowering carbon emissions and paving the way for high-accuracy forecasting and sustainable query scheduling in cloud-native lakehouse architectures.

📝 Abstract

Data analytics increasingly runs on distributed lakehouse systems, where platform operators must optimise monetary, resource, and environmental costs. Query Performance Prediction (QPP) helps to balance these costs and supports workload management techniques, such as adaptive resource scaling and low-carbon scheduling. However, runtimes in lakehouses can vary substantially, and the impact of runtime variance on QPP accuracy and workload orchestration has not previously been systematically studied for lakehouse systems. This paper addresses this gap by investigating the runtime variance observed for distributed lakehouse analytical queries and its impact on QPP. First, we quantify the run-to-run variance using Kubernetes deployments across three public clouds and one private cloud, spanning multiple database scales and three analytical benchmarks. Our results demonstrate that repeated executions of the same query can vary in runtime by nearly twofold. Second, we conduct a factor analysis study assessing key sources of this runtime variance such as data locality, co-tenant load, and caching effects. Third, we examine how variance influences state-of-the-art QPP models, revealing that addressing key sources of variance can reduce prediction error up to 80%. Finally, we demonstrate the downstream implications for low-carbon scheduling as an example of a workload management technique that relies on performance prediction, showing that accounting for runtime variance can lead to a significant reduction in carbon costs.

Problem

Research questions and friction points this paper is trying to address.

Lakehouse

Query Performance Prediction

Runtime Variance

Workload Management

Cloud Computing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lakehouse

Query Performance Prediction

Runtime Variance