🤖 AI Summary
This study investigates whether foundation models in remote sensing necessitate the large-scale parameterization typical of general-purpose vision models. Challenging the prevailing paradigm of scaling up model size, the work systematically evaluates representational redundancy and tests the hypothesis that remote sensing models enter an over-parameterized regime at significantly smaller scales than their vision counterparts. Through posterior uniform width pruning, learnable slimming training, feature correlation analysis, and explained variance ratio measurements, the authors reveal— for the first time—that remote sensing models exhibit highly redundant internal representations. Experiments demonstrate that these models retain over 71% relative accuracy at merely 1% of original FLOPs, with performance degradation seven times slower than that of MAE on ImageNet. The study introduces “slimming capacity” as both a diagnostic tool and a strategy for efficient deployment, advocating for a shift away from indiscriminate model scaling in remote sensing.
📝 Abstract
Large-scale foundation models (FMs) in remote sensing (RS) are developed based on the paradigms established in computer vision (CV) and have shown promise for various Earth observation applications. However, the direct transfer of scaling assumptions from CV to RS has not been adequately examined. We hypothesize that RS FMs enter an overparameterized regime at substantially smaller scales than their CV counterparts, where increasing parameter count primarily induces redundant representations rather than qualitatively new abstractions. To test this hypothesis, we use post-hoc slimming, where we uniformly reduce the width of pretrained encoder, as a tool to measure representational redundancy across six state-of-the-art RS FMs on four downstream classification tasks. Our findings reveal a significant contrast with those in the CV domain: while a post-hoc slimmed masked autoencoder (MAE) trained on ImageNet retains less than 10% accuracy at 1% FLOPs, RS FMs maintain over 71% relative accuracy at the same budget. This sevenfold difference provides strong empirical support for our hypothesis. We further demonstrate that learned slimmable training can improve both Momentum Contrast (MoCo)- and MAE- based models. In addition, through the explained variance ratio and the feature correlation analysis, we provide mechanistic explanations showing that RS FMs distribute task-relevant information with high redundancy. Our findings establish post-hoc slimmability as both a practical deployment strategy for resource-constrained environments and a diagnostic tool that challenges the prevailing scaling paradigm in RS. Upon acceptance, we will publish all code.