🤖 AI Summary
Training large language models (LLMs) incurs prohibitive computational costs, hindering both research and deployment. Existing FP8 training frameworks lack open-source, end-to-end solutions, impeding practical adoption. To address this, we propose the first open-source FP8 training framework supporting both continued pretraining and supervised fine-tuning. Our method introduces a fine-grained mixed-precision quantization strategy that adaptively selects quantization granularities—per-tensor, per-channel, or per-group—for weights, activations, and gradients, balancing numerical stability and hardware efficiency. Evaluated on a 160B-token corpus, our approach reduces training time by 22%, peak memory consumption by 14%, and improves throughput by 19% relative to BF16 baselines, while preserving inference accuracy. This work establishes the first efficient, robust, and fully open-source FP8 LLM training pipeline, bridging a critical gap in scalable LLM development.
📝 Abstract
The immense computational cost of training Large Language Models (LLMs) presents a major barrier to innovation. While FP8 training offers a promising solution with significant theoretical efficiency gains, its widespread adoption has been hindered by the lack of a comprehensive, open-source training recipe. To bridge this gap, we introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we will release the accompanying code to further democratize large-scale model training.