Rescaling-Aware Training for Efficient Deployment of Deep Learning Models on Full-Integer Hardware

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

In integer AI inference, hardware-resident integer rescaling operations incur prohibitive computational overhead due to high-precision multiplications, severely limiting energy efficiency on embedded systems. While quantization-aware training (QAT) mitigates accuracy degradation from quantization, it neglects the computational cost of rescaling itself. This work proposes Rescale-Aware Training (RAT), a novel QAT framework that explicitly incorporates rescaling multiplication factors into the quantization-aware optimization objective—enabling fine-grained adaptation for ultra-low-bitwidth rescalers (e.g., 4-bit). RAT introduces negligible training overhead and requires no hardware modifications. Experiments demonstrate zero accuracy loss even when rescaling precision is reduced by 8× (e.g., from 32-bit to 4-bit), alongside significant reductions in inference latency and energy consumption. Consequently, RAT enhances both energy efficiency and cost-effectiveness for AI deployment on edge devices.

Technology Category

Application Category

📝 Abstract

Integer AI inference significantly reduces computational complexity in embedded systems. Quantization-aware training (QAT) helps mitigate accuracy degradation associated with post-training quantization but still overlooks the impact of integer rescaling during inference, which is a hardware costly operation in integer-only AI inference. This work shows that rescaling cost can be dramatically reduced post-training, by applying a stronger quantization to the rescale multiplicands at no model-quality loss. Furthermore, we introduce Rescale-Aware Training, a fine tuning method for ultra-low bit-width rescaling multiplicands. Experiments show that even with 8x reduced rescaler widths, the full accuracy is preserved through minimal incremental retraining. This enables more energy-efficient and cost-efficient AI inference for resource-constrained embedded systems.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational cost of integer rescaling in AI inference

Minimizing hardware overhead for integer-only deep learning deployment

Preserving model accuracy while quantizing rescaling operations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rescale-Aware Training optimizes integer rescaling operations

Ultra-low bit-width multiplicands reduce hardware costs

Preserves model accuracy with minimal retraining effort

🔎 Similar Papers

An Efficient Real-Time Object Detection Framework on Resource-Constricted Hardware Devices via Software and Hardware Co-design