🤖 AI Summary
This work addresses the computational redundancy and spurious background correlations inherent in modern vision models due to dense processing of spatial feature maps, which compromise robustness and interpretability. The authors propose a differentiable energy minimization framework that formulates feature selection as an energy competition mechanism, combining unary importance costs with pairwise spatial consistency penalties. A lightweight energy-based masking layer is integrated into the convolutional backbone to enable input-adaptive dynamic sparsification—without requiring a fixed budget, heuristic scoring functions, or pixel-level supervision. The method preserves classification accuracy while significantly enhancing robustness against structured occlusions, produces semantically focused and highly interpretable masks, and demonstrates superior performance in deletion tests, substantially outperforming baseline approaches such as magnitude-based pruning.
📝 Abstract
Deep convolutional neural networks achieve remarkable performance by exhaustively processing dense spatial feature maps, yet this brute-force strategy introduces significant computational redundancy and encourages reliance on spurious background correlations. As a result, modern vision models remain brittle and difficult to interpret. We propose Energy-Regularized Spatial Masking (ERSM), a novel framework that reformulates feature selection as a differentiable energy minimization problem. By embedding a lightweight Energy-Mask Layer inside standard convolutional backbones, each visual token is assigned a scalar energy composed of two competing forces: an intrinsic Unary importance cost and a Pairwise spatial coherence penalty. Unlike prior pruning methods that enforce rigid sparsity budgets or rely on heuristic importance scores, ERSM allows the network to autonomously discover an optimal information-density equilibrium tailored to each input. We validate ERSM on convolutional architectures and demonstrate that it produces emergent sparsity, improved robustness to structured occlusion, and highly interpretable spatial masks, while preserving classification accuracy. Furthermore, we show that the learned energy ranking significantly outperforms magnitude-based pruning in deletion-based robustness tests, revealing ERSM as an intrinsic denoising mechanism that isolates semantic object regions without pixel-level supervision.