TorchAO: PyTorch-Native Training-to-Serving Model Optimization

📅 2025-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the fragmented model optimization workflows and poor cross-stack compatibility in the PyTorch ecosystem, this paper introduces the first end-to-end native AI model optimization framework. It establishes a unified abstraction for low-precision tensors—supporting FP8, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity—thereby closing the loop across pretraining, fine-tuning, quantization, and deployment. A novel tensor subclass design enables seamless backend interoperability and deep integration with major toolchains, including TorchTitan/Tune, vLLM, and Hugging Face Transformers. The framework significantly improves engineering efficiency and compression consistency. It has been successfully applied to quantized-sparse variants of Llama 3.2 (1B/3B) and LlamaGuard3-8B, demonstrating practical efficacy. The open-source implementation is publicly available on GitHub, enabling efficient, full-pipeline deployment.

Technology Category

Application Category

📝 Abstract
We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at https://github.com/pytorch/ao/.
Problem

Research questions and friction points this paper is trying to address.

Optimizes AI models from training to serving using quantization and sparsity
Supports diverse optimization techniques like FP8 training and 2:4 sparsity
Integrates fragmented ecosystem into a unified workflow for model optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

PyTorch-native framework for model optimization
Leverages quantization and sparsity techniques
End-to-end training-to-serving unified workflow
🔎 Similar Papers
No similar papers found.
A
Andrew Or
Meta Platforms Inc.
A
Apurva Jain
Meta Platforms Inc.
D
Daniel Vega-Myhre
Meta Platforms Inc.
Jesse Cai
Jesse Cai
Meta
machine learningartificial intelligencesparsityquantizationLLMs
C
Charles David Hernandez
Meta Platforms Inc.
Z
Zhenrui Zheng
Meta Platforms Inc.
D
Driss Guessous
Meta Platforms Inc.
V
Vasiliy Kuznetsov
Meta Platforms Inc.
C
Christian Puhrsch
Meta Platforms Inc.
Mark Saroufim
Mark Saroufim
Meta
Machine Learning
S
Supriya Rao
Meta Platforms Inc.
T
Thien Tran
Independent
A
Aleksandar Samardžić
OpenTeams Inc.