"Two-Stagification": Job Dispatching in Large-Scale Clusters via a Two-Stage Architecture

📅 2025-05-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To optimize average response time in large-scale FCFS server clusters, this paper proposes a lightweight two-stage architecture: jobs are partitioned based on an adaptive service-time threshold—short jobs are scheduled via JIQ or LWL, while long jobs employ Round-Robin (RR). This design decouples job-size sensitivity from scheduling complexity without requiring real-time server-state awareness, substantially reducing system implementation overhead. Evaluations under Weibull-synthetic workloads and Google cluster traces demonstrate that our approach significantly outperforms single-stage baselines across diverse load conditions and closely approaches the performance of state-of-the-art size-and-state-aware schedulers. The core contribution lies in empirically validating that architectural job partitioning—not scheduler complexity—is a more efficient pathway to near-optimal size-aware scheduling performance.

Technology Category

Application Category

📝 Abstract
A continuing effort is devoted to devising effective dispatching policies for clusters of First Come First Served servers. Although the optimal solution for dispatchers aware of both job size and server state remains elusive, lower bounds and strong heuristics are known. In this paper, we introduce a two-stage cluster architecture that applies classical Round Robin, Join Idle Queue, and Least Work Left dispatching schemes, coupled with an optimized service-time threshold to separate large jobs from shorter ones. Using both synthetic (Weibull) workloads and real Google data center traces, we demonstrate that our two-stage approach greatly improves upon the corresponding single-stage policies and closely approaches the performance of advanced size- and state-aware methods. Our results highlight that careful architectural design-rather than increased complexity at the dispatcher-can yield significantly better mean response times in large-scale computing environments.
Problem

Research questions and friction points this paper is trying to address.

Improving job dispatching in large-scale clusters
Separating large and short jobs effectively
Reducing mean response times with architectural design
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage cluster architecture for job dispatching
Combines Round Robin, Join Idle Queue, Least Work Left
Optimized service-time threshold separates large and small jobs
🔎 Similar Papers
No similar papers found.
M
Mert Yildiz
Dept. of Information Engineering, Electronics, and Telecommunications (DIET), University of Rome Sapienza, Italy
A
Alexey Rolich
Dept. of Information Engineering, Electronics, and Telecommunications (DIET), University of Rome Sapienza, Italy
Andrea Baiocchi
Andrea Baiocchi
University of Roma Sapienza - DIET
Networkingnetwork traffic engineeringperformance evaluation