🤖 AI Summary
To address the high memory and computational overhead of large language models (LLMs), this paper proposes MicroScale FP—a novel 8-bit floating-point format that employs block-wise shared scale factors instead of per-value scaling, enabling substantial storage compression while preserving wide dynamic range and high numerical precision. The format supports end-to-end training and inference. Evaluated on GPT-2 across multiple configurations, it achieves accuracy close to full-precision baselines, reduces memory footprint by approximately 50%, and significantly lowers computational cost. Its hardware-friendly design facilitates efficient implementation, and open-source code is provided. The core contribution lies in the first systematic integration of block-shared scaling into low-bit floating-point representation—establishing a new paradigm for efficient LLM deployment that jointly optimizes accuracy, dynamic range, and parameter compactness.
📝 Abstract
The increasing computational and memory demands of large language models (LLMs) necessitate innovative approaches to optimize resource usage without compromising performance. This paper leverages microscaling floating-point formats, a novel technique designed to address these challenges by reducing the storage and computational overhead associated with numerical representations in LLMs. Unlike traditional floating-point representations that allocate a dedicated scale for each value, microscaling employs a shared scale across a block of values, enabling compact one-byte floating-point representations while maintaining an extended dynamic range. We explore the application of microscaling in the context of 8-bit floating-point formats to significantly reduce memory footprint and computational costs. We tested several configurations of microscaling floats within the GPT-2 LLM architecture, demonstrating that microscaling data formats can achieve competitive accuracy during training and inference, proving its efficacy as a resource-efficient alternative for deploying LLMs at scale. The source code is publicly available at: https://github.com/unipi-dii-compressedarith/llm.c-sve