Speeding up Model Loading with fastsafetensors

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pretrained model parameter scales have surged, rendering the safetensors format inefficient for loading: existing approaches deserialize parameters sequentially into host memory before copying them to the device, causing severe I/O and memory bandwidth waste. This work introduces an end-to-end device-direct loading mechanism—first proposing disk-resident parameter grouping with direct DMA transfer to GPU memory and in-place tensor construction—eliminating host-memory staging entirely. The method supports peer-to-peer DMA, CUDA memory mapping, GPU-accelerated offloading, and I/O multithreaded asynchronous parallelism. Evaluated on Llama (7B/13B/70B), Falcon (40B), and Bloom (176B), it achieves 4.8×–7.5× faster model loading versus baseline methods, significantly improving large-model deployment efficiency.

Technology Category

Application Category

📝 Abstract
The rapid increases in model parameter sizes introduces new challenges in pre-trained model loading. Currently, machine learning code often deserializes each parameter as a tensor object in host memory before copying it to device memory. We found that this approach underutilized storage throughput and significantly slowed down loading large models with a widely-used model file formats, safetensors. In this work, we present fastsafetensors, a Python library designed to optimize the deserialization of tensors in safetensors files. Our approach first copies groups of on-disk parameters to device memory, where they are directly instantiated as tensor objects. This design enables further optimization in low-level I/O and high-level tensor preprocessing, including parallelized copying, peer-to-peer DMA, and GPU offloading. Experimental results show performance improvements of 4.8x to 7.5x in loading models such as Llama (7, 13, and 70 billion parameters), Falcon (40 billion parameters), and the Bloom (176 billion parameters).
Problem

Research questions and friction points this paper is trying to address.

Optimizes deserialization of tensors in safetensors files
Reduces model loading time for large parameter models
Improves storage throughput and I/O efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct tensor instantiation in device memory
Parallelized copying and GPU offloading
Optimized low-level I/O for safetensors
🔎 Similar Papers
T
Takeshi Yoshimura
IBM Research - Tokyo, Tokyo, Japan
T
Tatsuhiro Chiba
IBM Research - Tokyo, Tokyo, Japan
M
Manish Sethi
IBM Research, Durham, USA
Daniel Waddington
Daniel Waddington
IBM Research, Almaden
multicorepersistent memoryhigh-performance distributed computing
S
S. Sundararaman
IBM Research, San Jose, USA