SAR-TEXT: A Large-Scale SAR Image-Text Dataset Built with SAR-Narrator and Progressive Transfer Learning

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of large-scale, high-quality image-text datasets hindering semantic understanding of Synthetic Aperture Radar (SAR) imagery, this work introduces SAR-Text—the first SAR image–text dataset comprising over 130,000 samples. We further propose SAR-Narrator, a generative framework leveraging multi-stage progressive transfer learning to integrate SAR-specific domain priors with vision-language foundation models (e.g., CLIP, CoCa, GPT), enabling high-fidelity caption generation. The framework is designed for strong extensibility, facilitating community-driven dataset expansion. Experiments demonstrate substantial improvements: average recall in cross-modal retrieval increases by 16.43%; captioning performance achieves BLEU-4, SPICE, and CIDEr scores 8×, 4×, and 10× higher than baseline methods, respectively; and on SAR Visual Question Answering (SAR-VQA), the approach exhibits significantly enhanced semantic comprehension and reasoning capability.

Technology Category

Application Category

📝 Abstract
Vision Language Models (VLMs) have achieved remarkable breakthroughs in the field of remote sensing in recent years. Synthetic Aperture Radar (SAR) imagery, with its all-weather capability, is essential in remote sensing, yet the lack of large-scale, high-quality SAR image-text datasets hinders its semantic understanding. In this paper, we construct SAR-Text, a large-scale and high-quality dataset consisting of over 130,000 SAR image-text pairs. To construct the SAR-Text dataset, we design the SAR-Narrator framework, which generates textual descriptions for SAR images through a multi-stage progressive transfer learning strategy. To verify the effectiveness of the SAR-TEXT dataset, we conduct experiments on three typical vision-language tasks: image-text retrieval, image captioning, and visual question answering (VQA). Specifically, we construct three representative models on SAR-TEXT: SAR-RS-CLIP, SAR-RS-CoCa, and SAR-GPT. SAR-RS-CLIP achieves notable improvements in retrieval performance, boosting average recall by 16.43% and 10.54% on the OSdataset-512 and HRSID test sets, respectively. In the captioning task, SAR-RS-CoCa achieves BLEU-4, SPICE, and CIDEr scores exceeding those of the original CoCa model by more than 8x, 4x, and 10x, respectively. In the VQA task, SAR-GPT outperforms baseline and single-stage models on multiple SAR-VQA datasets, demonstrating stronger semantic understanding and reasoning ability, as further confirmed by qualitative results. It is worth noting that, as a flexible captioning tool, SAR-Narrator can be readily adopted by the community to construct larger-scale SAR image-text datasets.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale SAR image-text datasets for semantic understanding
Need for automated SAR image captioning and retrieval solutions
Improving SAR vision-language model performance on remote sensing tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

SAR-Narrator generates SAR image descriptions
Progressive transfer learning enhances dataset quality
Three VLMs improve SAR semantic tasks
🔎 Similar Papers
No similar papers found.
X
Xinjun Cheng
Intelligent Game and Decision Lab, Beijing, China
Y
Yiguo He
Intelligent Game and Decision Lab, Beijing, China
Junjie Zhu
Junjie Zhu
Shanghai Jiao Tong University
Intrinsically Disordered ProteinsGenerative ModelEnhanced Sampling
C
Chunping Qiu
Intelligent Game and Decision Lab, Beijing, China
J
Jun Wang
Intelligent Game and Decision Lab, Beijing, China
Q
Qiangjuan Huang
Intelligent Game and Decision Lab, Beijing, China
K
Ke Yang
Intelligent Game and Decision Lab, Beijing, China