Towards Fully FP8 GEMM LLM Training at Scale

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the challenge of achieving stable, end-to-end FP8 training for large language models (LLMs), where FP8 precision is often abandoned—particularly in sensitive attention projection layers—in favor of BF16. We propose the first Transformer training architecture supporting fully FP8-precision GEMM operations in both forward and backward passes. Our method introduces: (1) a full-FP8 linear layer design encompassing all attention projections; (2) a structured outlier-suppression scaling mechanism; and (3) a predictive divergence monitoring metric based on low-precision gradient and activation statistics. Evaluated on a thousand-GPU cluster, our approach enables stable FP8 training for ultra-long sequences while preserving BF16-level downstream task performance. Empirically, it delivers substantial improvements in training throughput without compromising model quality or convergence stability.

Technology Category

Application Category

📝 Abstract

Despite the significant potential of FP8 data formats for large language model (LLM) pre-training, their adoption has been limited due to challenges in maintaining stability at scale. Existing approaches often rely on suboptimal fine-grained FP8 kernels or fall back to higher-precision matrix multiplications (GEMMs) in sensitive components, such as attention projections, compromising potential throughput gains. We introduce a new class of LLM architectures that, for the first time, support FP8 computation for all GEMMs within transformer blocks during both forward and backward passes. This enables unprecedented throughput gains, particularly at scale, while matching the downstream performance of standard BF16 training. Our architecture design reduces large outlier activations, promoting stable long-term FP8 training. In addition, we identify key metrics to monitor low-precision training and predict potential future divergences.

Problem

Research questions and friction points this paper is trying to address.

Challenges in maintaining FP8 stability at scale for LLM training

Suboptimal FP8 kernels or fallback to higher-precision GEMMs in sensitive components

Need for FP8-compatible LLM architectures to maximize throughput without performance loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

FP8 computation for all transformer GEMMs

Reduces large outlier activations for stability

Monitors metrics to predict training divergences

🔎 Similar Papers

No similar papers found.