🤖 AI Summary
This paper investigates whether autoscaling in Spark serverless environments can improve resource utilization efficiency under fixed hardware constraints—particularly rigid node-level memory-to-CPU ratios. Leveraging fine-grained execution logs from large-scale production Spark batch jobs on Google Dataproc Serverless, we conduct controlled experiments augmented with statistical significance testing and granular resource monitoring. Our analysis, the first at the node level, reveals that current autoscaling mechanisms—constrained by immutable node sizes and static resource allocations—fail to dynamically adapt to workload demands; empirical results show no statistically significant improvement in resource efficiency. The core contribution is the identification and validation of “node-level resource rigidity” as the fundamental bottleneck to resource optimization in serverless Spark. This finding provides critical empirical evidence to guide the design of next-generation elastic schedulers capable of fine-grained, topology-aware resource orchestration.
📝 Abstract
Distributed dataflow systems like Spark and Flink enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs is often challenging. For efficient execution, individual resource allocations, such as memory and CPU cores, must meet the specific resource requirements of the job. An alternative to selecting a static resource allocation for a job execution is autoscaling as implemented for example by Spark.In this paper, we evaluate the resource efficiency of autoscaling batch data processing jobs based on resource demand both conceptually and experimentally by analyzing a new dataset of Spark job executions on Google Dataproc Serverless. In our experimental evaluation, we show that there is no significant resource efficiency gain over static resource allocations. We found that the inherent conceptual limitations of such autoscaling approaches are the inelasticity of node size as well as the inelasticity of the ratio of memory to CPU cores.