VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Remote sensing multimodal modeling has long suffered from a functional dichotomy between dual-encoder retrieval models—lacking fine-grained spatial reasoning—and generative-auxiliary models—suffering from poor scalability. This paper introduces RSME, the first unified single-encoder multimodal embedding model jointly processing images, text, bounding boxes, and geographic coordinates. RSME pioneers an interleaved input architecture that seamlessly integrates cross-modal representation learning with region-level spatial reasoning. We further establish RSMEB, a comprehensive remote sensing embedding benchmark covering six fine-grained geovisual tasks. Leveraging contrastive joint embedding, instruction tuning, and explicit geographic coordinate encoding, RSME achieves P@1 scores of 26.6%, 32.5%, and 17.8% on region–description retrieval, referring expression grounding, and semantic geolocation, respectively—matching or surpassing task-specific models. This work significantly advances general-purpose multimodal understanding in remote sensing.

Technology Category

Application Category

📝 Abstract
Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $ extbf{VLM2GeoVec}$, an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce $ extbf{RSMEB}$, a novel benchmark covering key remote-sensing embedding applications: scene classification; cross-modal search; compositional retrieval; visual-question answering; visual grounding and region-level reasoning; and semantic geospatial retrieval. On RSMEB, it achieves $ extbf{26.6%}$ P@1 on region-caption retrieval (+25 pp vs. dual-encoder baselines), $ extbf{32.5%}$ P@1 on referring-expression retrieval (+19 pp), and $ extbf{17.8%}$ P@1 on semantic geo-localization retrieval (over $3 imes$ prior best), while matching or exceeding specialized baselines on conventional tasks such as scene classification and cross-modal retrieval. VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. We will publicly release the code, checkpoints, and data upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Unifying retrieval and reasoning for remote sensing imagery
Creating a single encoder for interleaved multimodal inputs
Addressing fragmented approaches in remote sensing analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single encoder embeds interleaved multimodal inputs into unified vector space.
Contrastive training eliminates multi-stage pipelines and task-specific modules.
Unifies scalable retrieval with region-level spatial reasoning for remote sensing.
🔎 Similar Papers
No similar papers found.