Ablation Study of Block Size, Weight Precision, and Scale Precision in NVFP4 Inference for Low-Power Edge-Efficient Neural Networks

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the high computational energy consumption, substantial memory access overhead, and elevated hardware costs associated with neural network inference on edge devices by proposing the NVLUT framework. Leveraging the NVFP4 format, NVLUT introduces a two-level scaling mechanism—comprising FP8 block-wise scaling and FP32 tensor-wise scaling—to restore activation dynamic range without requiring retraining. It replaces mantissa multiplication via sign-exponent-mantissa decomposition and lookup-table (LUT)-based computation, while integrating voltage scaling for storage and selective ECC protection to achieve an optimal trade-off between energy efficiency and robustness at ultra-low bitwidths. Experiments demonstrate that a block size of B=16 yields optimal performance, with NVLUT achieving up to 26.85× energy savings and 2.21× area reduction compared to conventional LUT-based approaches, while FP4 weights combined with NVFP4 activations closely match the accuracy of FP8/FP16 baselines.

📝 Abstract

Energy-efficient edge inference requires reducing arithmetic cost, memory traffic, and hardware overhead. This paper presents an ablation-focused study of NVFP4 LUT-based inference for edge-efficient neural networks. The proposed NVLUT framework combines 4-bit NVFP4 activations, two-level scaling, LUT-based mantissa computation, voltage-scaled storage, and selective ECC protection. Multiplication is decomposed into sign, exponent, and mantissa paths, where sign uses XOR logic, exponent uses integer addition, and mantissa multiplication is replaced by compact LUT access. NVFP4 activations use FP4 data with an FP8 block scale and an FP32 tensor scale. Across six edge-efficient models, block-size ablation shows that B = 16 offers a practical accuracy/storage trade-off, requiring only 4.5078 bits per input for N = 4096. Weight-precision ablation shows that FP8 and FP16 weights provide only modest gains over FP4 weights under the same NVFP4 activation path. Compared with pure unscaled FP4, NVFP4 without retraining recovers substantial accuracy by restoring activation dynamic range, while NVFP4 with retraining achieves the best accuracy across models. Hardware analysis shows that NVLUT achieves up to 26.85x energy reduction over traditional LUTs with ECC plus voltage scaling and up to 22.85x under mixed-voltage operation. Area is reduced by up to 2.21x and 1.52x, respectively. These results demonstrate that NVFP4 two-level scaling with selective reliability protection enables robust, low-energy edge inference.

Problem

Research questions and friction points this paper is trying to address.

edge inference

energy efficiency

low-precision quantization

block size

numerical precision

Innovation

Methods, ideas, or system contributions that make the work stand out.

NVFP4

LUT-based inference

two-level scaling