π€ AI Summary
Traditional layer freezing still requires forward propagation through frozen layers, limiting computational efficiency gains; while feature map caching holds promise, it faces two overlooked challenges: augmentation invalidation and substantial memory overhead. This paper proposes the first systematic framework to address these issues: (1) a similarity-aware channel-level data augmentation strategy to mitigate distribution shift in cached features, and (2) a lossy progressive compression scheme that significantly reduces storage cost without compromising accuracy. Experiments across diverse models (ResNet, ViT) and benchmarks (ImageNet, CIFAR) demonstrate that our approach reduces training FLOPs by 32β47%, decreases GPU memory consumption by 58β73%, and incurs only marginal accuracy degradation (0.1β0.3%). These results validate the methodβs efficiency, robustness, and scalability.
π Abstract
With the growing size of deep neural networks and datasets, the computational costs of training have significantly increased. The layer-freezing technique has recently attracted great attention as a promising method to effectively reduce the cost of network training. However, in traditional layer-freezing methods, frozen layers are still required for forward propagation to generate feature maps for unfrozen layers, limiting the reduction of computation costs. To overcome this, prior works proposed a hypothetical solution, which caches feature maps from frozen layers as a new dataset, allowing later layers to train directly on stored feature maps. While this approach appears to be straightforward, it presents several major challenges that are severely overlooked by prior literature, such as how to effectively apply augmentations to feature maps and the substantial storage overhead introduced. If these overlooked challenges are not addressed, the performance of the caching method will be severely impacted and even make it infeasible. This paper is the first to comprehensively explore these challenges and provides a systematic solution. To improve training accuracy, we propose extit{similarity-aware channel augmentation}, which caches channels with high augmentation sensitivity with a minimum additional storage cost. To mitigate storage overhead, we incorporate lossy data compression into layer freezing and design a extit{progressive compression} strategy, which increases compression rates as more layers are frozen, effectively reducing storage costs. Finally, our solution achieves significant reductions in training cost while maintaining model accuracy, with a minor time overhead. Additionally, we conduct a comprehensive evaluation of freezing and compression strategies, providing insights into optimizing their application for efficient DNN training.