A Practical Cross-Layer Approach for ML-Driven Storage Placement in Warehouse-Scale Computers

📅 2025-01-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Machine learning–driven data placement remains impractical in hyperscale data centers due to high storage costs and deployment complexity. Method: This paper proposes a cross-layer co-design framework for data placement optimization: a lightweight, interpretable ML model at the application layer generates placement policies, while a customized scheduler at the storage layer dynamically executes them—departing from conventional single-layer modeling by jointly designing the application-layer ML component and storage-layer scheduler. The framework comprises a lightweight ML model, standardized cross-layer interfaces, and a coordinated scheduling algorithm, fully integrated into a production-grade distributed system and evaluated via large-scale trace-driven simulation and real-world deployment. Contribution/Results: It achieves up to 3.47× reduction in end-to-end total cost of ownership (TCO), significantly outperforming state-of-the-art baselines, and has completed proof-of-concept validation in Google’s production environment.

Technology Category

Application Category

📝 Abstract
Storage systems account for a major portion of the total cost of ownership (TCO) of warehouse-scale computers, and thus have a major impact on the overall system's efficiency. Machine learning (ML)-based methods for solving key problems in storage system efficiency, such as data placement, have shown significant promise. However, there are few known practical deployments of such methods. Studying this problem in the context of real-world hyperscale data center deployments at Google, we identify a number of challenges that we believe cause this lack of practical adoption. Specifically, prior work assumes a monolithic model that resides entirely within the storage layer, an unrealistic assumption in real-world data center deployments. We propose a cross-layer approach that moves ML out of the storage system and performs it in the application running on top of it, co-designed with a scheduling algorithm at the storage layer that consumes predictions from these application-level models. This approach combines small, interpretable models with a co-designed heuristic that adapts to different online environments. We build a proof-of-concept of this approach in a production distributed computation framework at Google. Evaluations in a test deployment and large-scale simulation studies using production traces show improvements of as much as 3.47x in TCO savings compared to state of the art baselines. We believe this work represents a significant step towards more practical ML-driven storage placement in warehouse-scale computers.
Problem

Research questions and friction points this paper is trying to address.

Machine Learning
Data Placement Optimization
Large-scale Data Centers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine Learning Integration
Data Layout Optimization
Storage Cost Reduction
🔎 Similar Papers
No similar papers found.
C
Chenxi Yang
UT Austin. Work was done while at Google.
Y
Yan Li
Google
M
Martin Maas
Google DeepMind
Mustafa Uysal
Mustafa Uysal
Google
Storage systemsoperating systems
U
Ubaid Ullah Hafeez
Google
Arif Merchant
Arif Merchant
Research Scientist, Google
Storage SystemsDistributed SystemsPerformance evaluation
R
Richard McDougall
Google