Microscaling Floating Point Formats for Large Language Models

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

188K/year
🤖 AI Summary
To address the high memory and computational overhead of large language models (LLMs), this paper proposes MicroScale FP—a novel 8-bit floating-point format that employs block-wise shared scale factors instead of per-value scaling, enabling substantial storage compression while preserving wide dynamic range and high numerical precision. The format supports end-to-end training and inference. Evaluated on GPT-2 across multiple configurations, it achieves accuracy close to full-precision baselines, reduces memory footprint by approximately 50%, and significantly lowers computational cost. Its hardware-friendly design facilitates efficient implementation, and open-source code is provided. The core contribution lies in the first systematic integration of block-shared scaling into low-bit floating-point representation—establishing a new paradigm for efficient LLM deployment that jointly optimizes accuracy, dynamic range, and parameter compactness.

Technology Category

Application Category

📝 Abstract
The increasing computational and memory demands of large language models (LLMs) necessitate innovative approaches to optimize resource usage without compromising performance. This paper leverages microscaling floating-point formats, a novel technique designed to address these challenges by reducing the storage and computational overhead associated with numerical representations in LLMs. Unlike traditional floating-point representations that allocate a dedicated scale for each value, microscaling employs a shared scale across a block of values, enabling compact one-byte floating-point representations while maintaining an extended dynamic range. We explore the application of microscaling in the context of 8-bit floating-point formats to significantly reduce memory footprint and computational costs. We tested several configurations of microscaling floats within the GPT-2 LLM architecture, demonstrating that microscaling data formats can achieve competitive accuracy during training and inference, proving its efficacy as a resource-efficient alternative for deploying LLMs at scale. The source code is publicly available at: https://github.com/unipi-dii-compressedarith/llm.c-sve
Problem

Research questions and friction points this paper is trying to address.

Reducing memory footprint of large language models
Optimizing computational costs for LLM deployment
Maintaining accuracy with compact floating-point representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Microscaling uses shared scale across value blocks
Enables compact one-byte floating-point representations
Reduces memory and computational costs in LLMs