SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding

📅 2025-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
SAR imagery exhibits distinct imaging mechanisms and low visual alignment with human perception, limiting existing vision-language models (VLMs) due to their lack of SAR-specific prior knowledge. To address this, we introduce SARLANG-1M—the first million-scale multimodal SAR benchmark—comprising real-world SAR images collected across 59+ global cities, covering 1,696 object classes and 16 land-cover categories. It features novel hierarchical resolution annotations, multi-granularity semantic descriptions, and cross-task question-answer pair construction. Data curation integrates multi-scale geospatial alignment and fine-grained semantic modeling via expert annotation. SARLANG-1M enables both VLM pretraining and comprehensive evaluation; fine-tuning mainstream VLMs on it yields substantial gains in SAR understanding performance, approaching expert-level accuracy. The dataset and code are publicly released, establishing a new standard for multimodal SAR interpretation.

Technology Category

Application Category

📝 Abstract
Synthetic Aperture Radar (SAR) is a crucial remote sensing technology, enabling all-weather, day-and-night observation with strong surface penetration for precise and continuous environmental monitoring and analysis. However, SAR image interpretation remains challenging due to its complex physical imaging mechanisms and significant visual disparities from human perception. Recently, Vision-Language Models (VLMs) have demonstrated remarkable success in RGB image understanding, offering powerful open-vocabulary interpretation and flexible language interaction. However, their application to SAR images is severely constrained by the absence of SAR-specific knowledge in their training distributions, leading to suboptimal performance. To address this limitation, we introduce SARLANG-1M, a large-scale benchmark tailored for multimodal SAR image understanding, with a primary focus on integrating SAR with textual modality. SARLANG-1M comprises more than 1 million high-quality SAR image-text pairs collected from over 59 cities worldwide. It features hierarchical resolutions (ranging from 0.1 to 25 meters), fine-grained semantic descriptions (including both concise and detailed captions), diverse remote sensing categories (1,696 object types and 16 land cover classes), and multi-task question-answering pairs spanning seven applications and 1,012 question types. Extensive experiments on mainstream VLMs demonstrate that fine-tuning with SARLANG-1M significantly enhances their performance in SAR image interpretation, reaching performance comparable to human experts. The dataset and code will be made publicly available at https://github.com/Jimmyxichen/SARLANG-1M.
Problem

Research questions and friction points this paper is trying to address.

SAR image interpretation is challenging due to complex imaging mechanisms.
Current VLMs lack SAR-specific knowledge for optimal performance.
SARLANG-1M benchmark enhances VLM performance in SAR understanding.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale SAR image-text benchmark
Hierarchical resolutions and fine-grained captions
Multi-task question-answering pairs
🔎 Similar Papers
No similar papers found.
Yimin Wei
Yimin Wei
Fudan University
mathematics
A
Aoran Xiao
RIKEN Center for Advanced Intelligence Project (AIP), RIKEN, Tokyo, 103-0027, Japan
Y
Yexian Ren
School of Electronics and Information Engineering, Nanjing University of Information Science and Technology, Nanjing, 430079, PR China
Y
Yuting Zhu
The School of Electronic and Communication Engineering, Sun Yat-sen University, Guangzhou, 510006, PR China
Hongruixuan Chen
Hongruixuan Chen
The University of Tokyo, RIKEN
Deep LearningComputer VisionGeoAIAI4EOMultimodal Remote Sensing
Junshi Xia
Junshi Xia
RIKEN AIP
Machine LearningClassificationRemote Sensing
Naoto Yokoya
Naoto Yokoya
The University of Tokyo, RIKEN
Remote SensingComputer VisionMachine LearningData Fusion