Efficient Self-Supervised Learning for Earth Observation via Dynamic Dataset Curation

📅 2025-04-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address pervasive redundancy and long-tailed class distributions in remote sensing imagery, this paper proposes a dynamic dataset pruning framework for self-supervised learning (SSL) on SAR data—requiring no pre-trained feature extractor. The method performs online diversity assessment and iterative sample reweighting to dynamically optimize the composition of the full decade-long Sentinel-1 WV archive, simultaneously ensuring intra-class and inter-class balance while enhancing representation robustness. It achieves the first end-to-end SSL pretraining specifically for SAR imagery and releases Nereus-SAR-1—the first foundation model tailored for marine SAR analysis. Evaluated on three downstream tasks, Nereus-SAR-1 improves linear probe accuracy by an average of 5.2% and accelerates training efficiency by 37%. All model weights are publicly released.

Technology Category

Application Category

📝 Abstract

Self-supervised learning (SSL) has enabled the development of vision foundation models for Earth Observation (EO), demonstrating strong transferability across diverse remote sensing tasks. While prior work has focused on network architectures and training strategies, the role of dataset curation, especially in balancing and diversifying pre-training datasets, remains underexplored. In EO, this challenge is amplified by the redundancy and heavy-tailed distributions common in satellite imagery, which can lead to biased representations and inefficient training. In this work, we propose a dynamic dataset pruning strategy designed to improve SSL pre-training by maximizing dataset diversity and balance. Our method iteratively refines the training set without requiring a pre-existing feature extractor, making it well-suited for domains where curated datasets are limited or unavailable. We demonstrate our approach on the Sentinel-1 Wave Mode (WV) Synthetic Aperture Radar (SAR) archive, a challenging dataset dominated by ocean observations. We train models from scratch on the entire Sentinel-1 WV archive spanning 10 years. Across three downstream tasks, our results show that dynamic pruning improves both computational efficiency and representation quality, leading to stronger transferability. We also release the weights of Nereus-SAR-1, the first model in the Nereus family, a series of foundation models for ocean observation and analysis using SAR imagery, at github.com/galeio-research/nereus-sar-models/.

Problem

Research questions and friction points this paper is trying to address.

Addressing dataset redundancy in EO SSL pre-training

Improving diversity and balance in satellite imagery datasets

Enhancing computational efficiency and representation quality in SAR models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic dataset pruning for diversity and balance

Iterative refinement without pre-existing feature extractor

Improved computational efficiency and representation quality

🔎 Similar Papers

No similar papers found.

Authors to Follow