🤖 AI Summary
This work addresses the limitations of deep convolutional networks, which suffer from gradient degradation as depth increases and rely on fixed residual connections that cannot adapt to varying inputs or tasks. The authors propose GradAttn, a novel framework that, for the first time, integrates attention mechanisms into gradient pathway control. GradAttn replaces static residual connections with learnable, dynamic paths, fusing multi-scale CNN features through self-attention to combine shallow textures and deep semantics, while positional encoding modulates gradient flow. Challenging the conventional assumption that more stable gradients always improve performance, the method achieves superior results on five out of eight benchmark datasets compared to ResNet-18, with an accuracy gain of up to 11.07% on FashionMNIST at comparable model size, demonstrating that controlled gradient instability can enhance generalization.
📝 Abstract
Deep ConvNets suffer from gradient signal degradation as network depth increases, limiting effective feature learning in complex architectures. ResNet addressed this through residual connections, but these fixed short-circuits cannot adapt to varying input complexity or selectively emphasize task relevant features across network hierarchies. This study introduces GradAttn, a hybrid CNN-transformer framework that replaces fixed residual connections with attention-controlled gradient flow. By extracting multi-scale CNN features at different depths and regulating them through self-attention, GradAttn dynamically weights shallow texture features and deep semantic representations. For representational analysis, we evaluated three GradAttn variants across eight diverse datasets, from natural images, medical imaging, to fashion recognition. Results demonstrate that GradAttn outperforms ResNet-18 on five of eight datasets, achieving up to +11.07% accuracy improvement on FashionMNIST while maintaining comparable network size. Gradient flow analysis reveals that controlled instabilities, introduced by attention, often coincide with improved generalization, challenging the assumption that perfect stability is optimal. Furthermore, positional encoding effectiveness proves dataset dependent, with CNN hierarchies frequently encoding sufficient spatial structure. These findings allow attention mechanisms as enablers of learnable gradient control, offering a new paradigm for adaptive representation learning in deep neural architectures.