bi-modal textual prompt learning for vision-language models in remote sensing

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing prompt learning methods struggle to capture dominant semantic cues in remote sensing images and exhibit limited generalization under challenges such as multi-label annotations, high intra-class variability, and diverse spatial resolutions. To address these limitations, this work proposes BiMoRS, the first framework to introduce bimodal (textual + visual) prompt learning to remote sensing. BiMoRS leverages a frozen BLIP-2 model to generate image-to-text summaries and combines them with CLIP visual features through a lightweight cross-attention mechanism to construct context-aware prompts, enabling efficient adaptation to downstream tasks without fine-tuning the CLIP backbone. Experiments demonstrate that BiMoRS consistently outperforms strong baselines across four remote sensing datasets and three domain generalization benchmarks, achieving an average performance gain of 2%.

Technology Category

Application Category

📝 Abstract
Prompt learning (PL) has emerged as an effective strategy to adapt vision-language models (VLMs), such as CLIP, for downstream tasks under limited supervision. While PL has demonstrated strong generalization on natural image datasets, its transferability to remote sensing (RS) imagery remains underexplored. RS data present unique challenges, including multi-label scenes, high intra-class variability, and diverse spatial resolutions, that hinder the direct applicability of existing PL methods. In particular, current prompt-based approaches often struggle to identify dominant semantic cues and fail to generalize to novel classes in RS scenarios. To address these challenges, we propose BiMoRS, a lightweight bi-modal prompt learning framework tailored for RS tasks. BiMoRS employs a frozen image captioning model (e.g., BLIP-2) to extract textual semantic summaries from RS images. These captions are tokenized using a BERT tokenizer and fused with high-level visual features from the CLIP encoder. A lightweight cross-attention module then conditions a learnable query prompt on the fused textual-visual representation, yielding contextualized prompts without altering the CLIP backbone. We evaluate BiMoRS on four RS datasets across three domain generalization (DG) tasks and observe consistent performance gains, outperforming strong baselines by up to 2% on average. Codes are available at https://github.com/ipankhi/BiMoRS.
Problem

Research questions and friction points this paper is trying to address.

prompt learning
vision-language models
remote sensing
domain generalization
semantic cues
Innovation

Methods, ideas, or system contributions that make the work stand out.

bi-modal prompt learning
vision-language models
remote sensing
domain generalization
frozen captioning model
🔎 Similar Papers
No similar papers found.