MGS: Markov Greedy Sums for Accurate Low-Bitwidth Floating-Point Accumulation

πŸ“… 2025-04-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In low-bit floating-point (e.g., 8-bit) dot-product accumulation, significant mantissa right-shifting and precision loss occur due to exponent mismatches, leading to dominant rounding errors. Method: This paper proposes an exponent-aware greedy summation scheduling strategy, coupled with Markov modeling to optimize the accumulation pathβ€”without increasing operand bitwidth. It further introduces a lightweight dMAC hardware unit supporting efficient low-precision floating-point accumulation and establishes a theoretical numerical error model to guide design optimization. Contribution/Results: Experiments across multiple image classification benchmarks achieve accuracy parity with 32-bit floating-point baselines. The dMAC unit reduces hardware power consumption by up to 34.1% compared to conventional implementations, demonstrating simultaneous gains in both computational accuracy and energy efficiency.

Technology Category

Application Category

πŸ“ Abstract
We offer a novel approach, MGS (Markov Greedy Sums), to improve the accuracy of low-bitwidth floating-point dot products in neural network computations. In conventional 32-bit floating-point summation, adding values with different exponents may lead to loss of precision in the mantissa of the smaller term, which is right-shifted to align with the larger term's exponent. Such shifting (a.k.a. 'swamping') is a significant source of numerical errors in accumulation when implementing low-bitwidth dot products (e.g., 8-bit floating point) as the mantissa has a small number of bits. We avoid most swamping errors by arranging the terms in dot product summation based on their exponents and summing the mantissas without overflowing the low-bitwidth accumulator. We design, analyze, and implement the algorithm to minimize 8-bit floating point error at inference time for several neural networks. In contrast to traditional sequential summation, our method has significantly lowered numerical errors, achieving classification accuracy on par with high-precision floating-point baselines for multiple image classification tasks. Our dMAC hardware units can reduce power consumption by up to 34.1% relative to conventional MAC units.
Problem

Research questions and friction points this paper is trying to address.

Improve accuracy of low-bitwidth floating-point dot products
Reduce numerical errors in 8-bit floating point accumulation
Minimize power consumption in neural network computations
Innovation

Methods, ideas, or system contributions that make the work stand out.

MGS arranges terms by exponents for accuracy
Sums mantissas without low-bitwidth overflow
dMAC hardware cuts power by 34.1%
πŸ”Ž Similar Papers
No similar papers found.
V
Vikas Natesh
Harvard University, Cambridge, MA, USA
H. T. Kung
H. T. Kung
Professor, Harvard University
Machine learning acceleratorshigh-performance computingcomputer & wireless networkscomplexitydatabase systems
D
David Kong
Harvard University, Cambridge, MA, USA