FalconFS: Distributed File System for Large-Scale Deep Learning Pipeline

📅 2025-07-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address inefficient and memory-intensive metadata caching in DFS clients during deep learning training, this paper proposes FalconFS—a distributed file system featuring a stateless client architecture. Its core innovations include: (1) server-side hybrid metadata indexing and lazy namespace replication to enable efficient path resolution; and (2) request merging, VFS-based rapid deployment, and a lightweight client design that eliminates reliance on client-side metadata caching. Experimental evaluation demonstrates that FalconFS achieves 5.72× higher small-file I/O throughput and 12.81× greater end-to-end model training throughput compared to CephFS and Lustre. Furthermore, FalconFS has been deployed stably for one year in Huawei’s production-scale autonomous driving environment—comprising over 10,000 NPUs—validating its scalability and industrial viability for large-scale AI workloads.

Technology Category

Application Category

📝 Abstract
Client-side metadata caching has long been considered an effective method for accelerating metadata operations in distributed file systems (DFSs). However, we have found that client-side state (e.g., caching) is not only ineffective but also consumes valuable memory resources in the deep learning pipelines. We thus propose FalconFS, a DFS optimized for deep learning pipelines with the stateless-client architecture. Specifically, instead of performing client-side path resolution and caching, FalconFS efficiently resolves paths on the server side using hybrid metadata indexing and lazy namespace replication. FalconFS also boosts server concurrency with concurrent request merging and provides easy deployment with VFS shortcut. Evaluations against CephFS and Lustre show that FalconFS achieves up to 5.72$ imes$ throughput for small file read/write and up to 12.81$ imes$ throughput for deep learning model training. FalconFS has been running in Huawei autonomous driving system's production environment with 10,000 NPUs for one year.
Problem

Research questions and friction points this paper is trying to address.

Optimizes distributed file systems for deep learning pipelines
Eliminates ineffective client-side metadata caching in DFS
Enhances server-side path resolution and concurrency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stateless-client architecture for deep learning
Hybrid metadata indexing on server side
Concurrent request merging for server concurrency
🔎 Similar Papers
No similar papers found.
J
Jingwei Xu
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
J
Junbin Kang
Huawei Technologies
Mingkai Dong
Mingkai Dong
Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University (SJTU)
Operating SystemsFile SystemsDNA StorageStorage SystemsNon-volatile Memory
Mingyu Liu
Mingyu Liu
Technical University of Munich
Computer VisionDeep Learning
L
Lu Zhang
Huawei Technologies
S
Shaohong Guo
Huawei Technologies
Z
Ziyan Qiu
Huawei Technologies
M
Mingzhen You
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
Z
Ziyi Tian
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
A
Anqi Yu
Huawei Technologies
T
Tianhong Ding
Huawei Technologies
X
Xinwei Hu
Huawei Technologies
H
Haibo Chen
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University