Speeding up Model Loading with fastsafetensors

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Pretrained model parameter scales have surged, rendering the safetensors format inefficient for loading: existing approaches deserialize parameters sequentially into host memory before copying them to the device, causing severe I/O and memory bandwidth waste. This work introduces an end-to-end device-direct loading mechanism—first proposing disk-resident parameter grouping with direct DMA transfer to GPU memory and in-place tensor construction—eliminating host-memory staging entirely. The method supports peer-to-peer DMA, CUDA memory mapping, GPU-accelerated offloading, and I/O multithreaded asynchronous parallelism. Evaluated on Llama (7B/13B/70B), Falcon (40B), and Bloom (176B), it achieves 4.8×–7.5× faster model loading versus baseline methods, significantly improving large-model deployment efficiency.

Technology Category

Application Category

📝 Abstract

The rapid increases in model parameter sizes introduces new challenges in pre-trained model loading. Currently, machine learning code often deserializes each parameter as a tensor object in host memory before copying it to device memory. We found that this approach underutilized storage throughput and significantly slowed down loading large models with a widely-used model file formats, safetensors. In this work, we present fastsafetensors, a Python library designed to optimize the deserialization of tensors in safetensors files. Our approach first copies groups of on-disk parameters to device memory, where they are directly instantiated as tensor objects. This design enables further optimization in low-level I/O and high-level tensor preprocessing, including parallelized copying, peer-to-peer DMA, and GPU offloading. Experimental results show performance improvements of 4.8x to 7.5x in loading models such as Llama (7, 13, and 70 billion parameters), Falcon (40 billion parameters), and the Bloom (176 billion parameters).

Problem

Research questions and friction points this paper is trying to address.

Optimizes deserialization of tensors in safetensors files

Reduces model loading time for large parameter models

Improves storage throughput and I/O efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct tensor instantiation in device memory

Parallelized copying and GPU offloading

Optimized low-level I/O for safetensors

🔎 Similar Papers

SSDTrain: An Activation Offloading Framework to SSDs for Faster Large Language Model Training