🤖 AI Summary
This work systematically investigates the potential and limitations of remote sensing multimodal foundation models in representation learning. Addressing the challenge of jointly modeling heterogeneous, multi-temporal, and multimodal remote sensing data, we first construct a taxonomy of remote sensing multimodal foundation models, rigorously analyzing trade-offs among transfer adaptability, generalization capability, and computational efficiency across prevailing approaches. We propose a differentiated modeling paradigm comprising cross-sensor alignment, lightweight unsupervised pretraining, and modality-aware fusion. Furthermore, we design a comprehensive representation quality evaluation framework to empirically validate model efficacy on large-scale unlabeled and seasonally varying remote sensing data. Our contributions include: (1) a principled taxonomy clarifying architectural and methodological constraints; (2) novel modeling strategies balancing expressivity and efficiency; and (3) an evaluation protocol demonstrating robust unsupervised learning performance. Collectively, this work establishes a theoretical framework and practical technical pathway toward efficient, generalizable, domain-specific foundation models for remote sensing.
📝 Abstract
Foundation models have garnered increasing attention for representation learning in remote sensing, primarily adopting approaches that have demonstrated success in computer vision with minimal domain-specific modification. However, the development and application of foundation models in this field are still burgeoning, as there are a variety of competing approaches that each come with significant benefits and drawbacks. This paper examines these approaches along with their roots in the computer vision field in order to characterize potential advantages and pitfalls while outlining future directions to further improve remote sensing-specific foundation models. We discuss the quality of the learned representations and methods to alleviate the need for massive compute resources. We place emphasis on the multi-sensor aspect of Earth observations, and the extent to which existing approaches leverage multiple sensors in training foundation models in relation to multi-modal foundation models. Finally, we identify opportunities for further harnessing the vast amounts of unlabeled, seasonal, and multi-sensor remote sensing observations.