🤖 AI Summary
This work addresses the limited interpretability of existing vision-based image geolocation models, which often fail to reveal the rationale behind their predictions. To overcome this limitation, the authors propose a multi-layer gradient-weighted class activation mapping (Grad-CAM) fusion strategy that moves beyond conventional approaches relying solely on the deepest convolutional features. By integrating activation signals from both intermediate and deep layers of a convolutional neural network, the method generates finer-grained and more comprehensive visual explanations of model decisions. This approach substantially enhances model transparency and trustworthiness, outperforming existing single-layer Grad-CAM techniques in interpretability and providing richer, more accurate justifications for geolocation predictions.
📝 Abstract
Planet-scale photo geolocalization involves the intricate task of estimating the geographic location depicted in an image purely based on its visual features. While deep learning models, particularly convolutional neural networks (CNNs), have significantly advanced this field, understanding the reasoning behind their predictions remains challenging. In this paper, we present Combi-CAM, a novel method that enhances the explainability of CNN-based geolocalization models by combining gradient-weighted class activation maps obtained from several layers of the network architecture, rather than using only information from the deepest layer as is typically done. This approach provides a more detailed understanding of how different image features contribute to the model's decisions, offering deeper insights than the traditional approaches.