Learning to Schedule: A Supervised Learning Framework for Network-Aware Scheduling of Data-Intensive Workloads

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Data-intensive applications in distributed cloud environments suffer performance degradation due to network congestion, asymmetric bandwidth, and cross-node data shuffling—factors inadequately captured by conventional host-resource–centric schedulers (e.g., CPU/memory-based). To address this, we propose the first supervised learning–driven, network-aware scheduling framework for multi-site clusters. Our approach integrates real-time Kubernetes node telemetry with FABRIC’s programmable network topology to train a Spark job execution time prediction model, enabling task-to-node matching and ranking. Its key innovation lies in the first application of supervised learning to real-time, geographically distributed, network-aware scheduling. Experimental evaluation demonstrates that our method improves optimal node selection accuracy by 34–54% over the default Kubernetes scheduler, significantly reducing data transfer latency and shortening job completion time.

Technology Category

Application Category

📝 Abstract
Distributed cloud environments hosting data-intensive applications often experience slowdowns due to network congestion, asymmetric bandwidth, and inter-node data shuffling. These factors are typically not captured by traditional host-level metrics like CPU or memory. Scheduling without accounting for these conditions can lead to poor placement decisions, longer data transfers, and suboptimal job performance. We present a network-aware job scheduler that uses supervised learning to predict the completion time of candidate jobs. Our system introduces a prediction-and-ranking mechanism that collects real-time telemetry from all nodes, uses a trained supervised model to estimate job duration per node, and ranks them to select the best placement. We evaluate the scheduler on a geo-distributed Kubernetes cluster deployed on the FABRIC testbed by running network-intensive Spark workloads. Compared to the default Kubernetes scheduler, which makes placement decisions based on current resource availability alone, our proposed supervised scheduler achieved 34-54% higher accuracy in selecting optimal nodes for job placement. The novelty of our work lies in the demonstration of supervised learning for real-time, network-aware job scheduling on a multi-site cluster.
Problem

Research questions and friction points this paper is trying to address.

Scheduling data-intensive workloads without network awareness
Traditional schedulers ignore network congestion and asymmetric bandwidth
Poor job placement decisions due to missing network metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses supervised learning for job completion prediction
Implements prediction-and-ranking mechanism with real-time telemetry
Demonstrates network-aware scheduling on multi-site clusters
🔎 Similar Papers
No similar papers found.