🤖 AI Summary
Existing MLLM fine-tuning approaches for remote sensing imagery are often hindered by background noise or neglect of fine-grained details, struggling to handle challenges such as large-scale variations, sparse targets, and complex semantics. This work proposes GRASP, a parameter-efficient fine-tuning strategy that aligns spatially structured soft prompts with spatial blocks in a frozen visual token grid and introduces a query-guided sparse fusion mechanism. This mechanism dynamically aggregates task-relevant context to generate compact global prompts that emphasize critical regions while suppressing distractions. By innovatively integrating region-aware sparse prompting with efficient context aggregation, GRASP significantly outperforms current fine-tuning and prompting methods across multiple RSVQA benchmarks, all while maintaining high parameter efficiency.
📝 Abstract
In recent years, Multimodal Large Language Models (MLLMs) have made significant progress in visual question answering tasks. However, directly applying existing fine-tuning methods to remote sensing (RS) images often leads to issues such as overfitting on background noise or neglecting target details. This is primarily due to the large-scale variations, sparse target distributions, and complex regional semantic features inherent in RS images. These challenges limit the effectiveness of MLLMs in RS tasks. To address these challenges, we propose a parameter-efficient fine-tuning (PEFT) strategy called Guided Region-Aware Sparse Prompting (GRASP). GRASP introduces spatially structured soft prompts associated with spatial blocks extracted from a frozen visual token grid. Through a question-guided sparse fusion mechanism, GRASP dynamically aggregates task-specific context into a compact global prompt, enabling the model to focus on relevant regions while filtering out background noise. Extensive experiments on multiple RSVQA benchmarks show that GRASP achieves competitive performance compared to existing fine-tuning and prompt-based methods while maintaining high parameter efficiency.