🤖 AI Summary
On resource-constrained devices, depthwise separable CNNs (e.g., MobileNetV2) are dominated by costly pointwise convolutions, yet existing quantization methods fail to address their non-uniform computational distribution. Method: We propose a hierarchical quantization strategy: ternary weights (−1, 0, +1) for pointwise convolutions—the computational bottleneck—while uniformly applying 8-bit quantization to weights and activations in all other layers. Crucially, we combine ternary pointwise convolutions with 8-bit activations to enable purely int8-additive computation, eliminating all multiplications. Quantization-aware training with structural constraints ensures hardware efficiency and accuracy retention. Results: Evaluated on ImageNet, our method achieves a 23.9× reduction in energy consumption and a 2.7× decrease in model storage over a float16 baseline, with negligible top-1 accuracy degradation (<0.1%). This significantly advances the energy-efficiency–accuracy Pareto frontier for edge-deployable CNNs.
📝 Abstract
Convolutional neural networks (CNNs) are crucial for computer vision tasks on resource-constrained devices. Quantization effectively compresses these models, reducing storage size and energy cost. However, in modern depthwise-separable architectures, the computational cost is distributed unevenly across its components, with pointwise operations being the most expensive. By applying a general quantization scheme to this imbalanced cost distribution, existing quantization approaches fail to fully exploit potential efficiency gains. To this end, we introduce PROM, a straightforward approach for quantizing modern depthwise-separable convolutional networks by selectively using two distinct bit-widths. Specifically, pointwise convolutions are quantized to ternary weights, while the remaining modules use 8-bit weights, which is achieved through a simple quantization-aware training procedure. Additionally, by quantizing activations to 8-bit, our method transforms pointwise convolutions with ternary weights into int8 additions, which enjoy broad support across hardware platforms and effectively eliminates the need for expensive multiplications. Applying PROM to MobileNetV2 reduces the model's energy cost by more than an order of magnitude (23.9x) and its storage size by 2.7x compared to the float16 baseline while retaining similar classification performance on ImageNet. Our method advances the Pareto frontier for energy consumption vs. top-1 accuracy for quantized convolutional models on ImageNet. PROM addresses the challenges of quantizing depthwise-separable convolutional networks to both ternary and 8-bit weights, offering a simple way to reduce energy cost and storage size.