🤖 AI Summary
Standard CNNs suffer from significant performance degradation under single-pixel translations due to the spatial sensitivity of fully connected layers, lacking true translation invariance. This work proposes inserting global average pooling (GAP) layers at multiple depths to decouple feature recognition from spatial localization, yielding a lightweight “online architecture.” The study systematically demonstrates for the first time that GAP substantially enhances translation robustness while drastically reducing model size, and identifies periodic aliasing induced by discrete pooling as the fundamental barrier to pixel-level invariance. Applied to VGG-16, the method reduces parameters by 98% (from 5.2M to 82K) and model footprint by 90%, achieving a 66.4% ImageNet Top-1 accuracy and doubling translation robustness. It also markedly improves image quality assessment on KADID-10k and RAID datasets, attaining Spearman correlation coefficients of 0.89 and 0.95, respectively.
📝 Abstract
Convolutional Neural Networks (CNNs) are widely assumed to be translation-invariant, yet standard architectures exhibit a startling fragility: even a single-pixel shift can drastically degrade performance due to their reliance on spatially dependent fully connected layers. In this work, we resolve this vulnerability by proposing a lightweight 'Online Architecture' strategy. By strategically inserting Global Average Pooling (GAP) layers at various network depths, we effectively decouple feature recognition from spatial location. Using VGG-16 as a primary case study, we demonstrate that this architectural modification achieves a massive 98% reduction in trainable parameters (from 5.2M to just 82K) and a 90% reduction in total network size (138M to 14M). Despite this drastic pruning, our variants maintain competitive Top-1 accuracy on ImageNet (66.4%) while doubling translational robustness, reducing average relative loss from 0.09 to 0.05. Furthermore, our analysis identifies a fundamental limit to invariance: while GAP resolves macroscopic sensitivity, discrete pooling operations introduce a residual periodic aliasing that prevents perfect pixel-level stability. Finally, we extend these findings to Perceptual Image Quality Assessment (IQA) by integrating our invariant backbones into the LPIPS framework. The resulting metric significantly outperforms the retrained baseline in generalization across the KADID-10k dataset (Spearman 0.89 vs. 0.75) and achieves a near-perfect alignment with human psychophysical response curves on the RAID dataset (Spearman 0.95). These results confirm that enforcing architectural invariance is a far more efficient and biologically plausible path to robustness than traditional data augmentation. Data and code are publicly available. The data and code are publicly available to facilitate validation and further research.