🤖 AI Summary
SAR imagery suffers from geometric complexity and high sensitivity to acquisition parameters, limiting performance in downstream tasks such as elevation reconstruction and semantic segmentation. To address this, we propose Param-ViT, a parameter-aware vision transformer. Its core innovation is a novel imaging-parameter encoding module that explicitly injects key acquisition parameters—such as incidence angle and range resolution—into the self-attention mechanism of the ViT backbone. Param-ViT further incorporates self-supervised pretraining followed by few-shot fine-tuning to enhance generalization under label-scarce conditions. Ablation studies confirm the effectiveness of each component. Evaluated on real-world SAR datasets, Param-ViT achieves up to a 17% reduction in RMSE for elevation reconstruction compared to state-of-the-art baselines, while significantly improving semantic segmentation mIoU. Notably, it maintains strong robustness even with extremely limited annotated data.
📝 Abstract
This manuscript introduces SARFormer, a modified Vision Transformer (ViT) architecture designed for processing one or multiple synthetic aperture radar (SAR) images. Given the complex image geometry of SAR data, we propose an acquisition parameter encoding module that significantly guides the learning process, especially in the case of multiple images, leading to improved performance on downstream tasks. We further explore self-supervised pre-training, conduct experiments with limited labeled data, and benchmark our contribution and adaptations thoroughly in ablation experiments against a baseline, where the model is tested on tasks such as height reconstruction and segmentation. Our approach achieves up to 17% improvement in terms of RMSE over baseline models