🤖 AI Summary
Existing model compression methods struggle to ensure consistent inference performance across identical-edge devices due to hardware-level heterogeneity—arising from manufacturing variations, environmental fluctuations, and device aging. To address this, we propose Homogeneous Device-Aware Pruning (HDAP), the first hardware-aware compression framework that explicitly models performance drift among same-model devices. HDAP comprises three key components: (i) hardware-response-based device clustering, (ii) surrogate-model-driven latency prediction, and (iii) structured pruning jointly optimized for FLOPs and latency. Evaluated on ResNet50 and MobileNetV1 under a 1.0G FLOPs constraint, HDAP achieves a 2.86× average latency reduction—outperforming state-of-the-art methods—and demonstrates strong cross-device robustness and scalability.
📝 Abstract
Deploying deep neural networks (DNNs) across homogeneous edge devices (the devices with the same SKU labeled by the manufacturer) often assumes identical performance among them. However, once a device model is widely deployed, the performance of each device becomes different after a period of running. This is caused by the differences in user configurations, environmental conditions, manufacturing variances, battery degradation, etc. Existing DNN compression methods have not taken this scenario into consideration and can not guarantee good compression results in all homogeneous edge devices. To address this, we propose Homogeneous-Device Aware Pruning (HDAP), a hardware-aware DNN compression framework explicitly designed for homogeneous edge devices, aiming to achieve optimal average performance of the compressed model across all devices. To deal with the difficulty of time-consuming hardware-aware evaluations for thousands or millions of homogeneous edge devices, HDAP partitions all the devices into several device clusters, which can dramatically reduce the number of devices to evaluate and use the surrogate-based evaluation instead of hardware evaluation in real-time. Experiments on ResNet50 and MobileNetV1 with the ImageNet dataset show that HDAP consistently achieves lower average inference latency compared with state-of-the-art methods, with substantial speedup gains (e.g., 2.86 $ imes$ speedup at 1.0G FLOPs for ResNet50) on the homogeneous device clusters. HDAP offers an effective solution for scalable, high-performance DNN deployment methods for homogeneous edge devices.