Flora: Efficient Cloud Resource Selection for Big Data Processing via Job Classification

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of resource configuration, volatile costs, and the trade-off between efficiency and expenditure for big-data streaming jobs (e.g., Spark/Flink) in public clouds, this paper proposes a lightweight job classification and configuration recommendation framework grounded in data access patterns. Methodologically, it first extracts I/O features from jobs and trains a supervised classifier to enable cross-job-type experience transfer; then, it dynamically derives cost-optimal CPU/memory configurations by jointly leveraging empirical execution feedback and real-time cloud pricing. The key innovation lies in enabling generalizable recommendations—without runtime system modifications—by using data access patterns as the primary classification criterion, while providing bounded cost deviation guarantees (average <6%, worst-case <24%). Evaluated on a novel dataset comprising 180 Google Cloud Spark jobs, the framework significantly outperforms existing baseline methods.

Technology Category

Application Category

📝 Abstract
Distributed dataflow systems like Spark and Flink enable data-parallel processing of large datasets on clusters of cloud resources. Yet, selecting appropriate computational resources for dataflow jobs is often challenging. For efficient execution, individual resource allocations, such as memory and CPU cores, must meet the specific resource demands of the job. Meanwhile, the choices of cloud configurations are often plentiful, especially in public clouds, and the current cost of the available resource options can fluctuate. Addressing this challenge, we present Flora, a low-overhead approach to cost-optimizing cloud cluster configurations for big data processing. Flora lets users categorize jobs according to their data access patterns and derives suitable cluster resource configurations from executions of test jobs of the same category, considering current resource costs. In our evaluation on a new dataset comprising 180 Spark job executions on Google Cloud, Flora's cluster resource selections exhibit an average deviation below 6% from the most cost-optimal solution, with a maximum deviation below 24%.
Problem

Research questions and friction points this paper is trying to address.

Optimizing cloud resource selection for big data processing
Classifying jobs by data access patterns for efficient execution
Minimizing cost deviations in cloud cluster configurations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Job classification based on data access patterns
Cost-optimized cloud cluster configurations
Low-overhead test job execution analysis
🔎 Similar Papers
No similar papers found.