GRASP: Guided Region-Aware Sparse Prompting for Adapting MLLMs to Remote Sensing

📅 2026-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing MLLM fine-tuning approaches for remote sensing imagery are often hindered by background noise or neglect of fine-grained details, struggling to handle challenges such as large-scale variations, sparse targets, and complex semantics. This work proposes GRASP, a parameter-efficient fine-tuning strategy that aligns spatially structured soft prompts with spatial blocks in a frozen visual token grid and introduces a query-guided sparse fusion mechanism. This mechanism dynamically aggregates task-relevant context to generate compact global prompts that emphasize critical regions while suppressing distractions. By innovatively integrating region-aware sparse prompting with efficient context aggregation, GRASP significantly outperforms current fine-tuning and prompting methods across multiple RSVQA benchmarks, all while maintaining high parameter efficiency.

Technology Category

Application Category

📝 Abstract
In recent years, Multimodal Large Language Models (MLLMs) have made significant progress in visual question answering tasks. However, directly applying existing fine-tuning methods to remote sensing (RS) images often leads to issues such as overfitting on background noise or neglecting target details. This is primarily due to the large-scale variations, sparse target distributions, and complex regional semantic features inherent in RS images. These challenges limit the effectiveness of MLLMs in RS tasks. To address these challenges, we propose a parameter-efficient fine-tuning (PEFT) strategy called Guided Region-Aware Sparse Prompting (GRASP). GRASP introduces spatially structured soft prompts associated with spatial blocks extracted from a frozen visual token grid. Through a question-guided sparse fusion mechanism, GRASP dynamically aggregates task-specific context into a compact global prompt, enabling the model to focus on relevant regions while filtering out background noise. Extensive experiments on multiple RSVQA benchmarks show that GRASP achieves competitive performance compared to existing fine-tuning and prompt-based methods while maintaining high parameter efficiency.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Remote Sensing
Visual Question Answering
Sparse Target Distribution
Regional Semantic Complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

parameter-efficient fine-tuning
region-aware prompting
sparse fusion
multimodal large language models
remote sensing
🔎 Similar Papers
No similar papers found.
Q
Qigan Sun
School of Computing, Kyung Hee University, Yongin-si, South Korea
Chaoning Zhang
Chaoning Zhang
Professor at UESTC (电子科技大学, China)
Computer VisionLLM and VLMGenAI and AIGC Detection
Jianwei Zhang
Jianwei Zhang
Professor, School of Education, University at Albany, SUNY
CSCLlearning sciencestechnology for creativityknowledge buildinginquiry-based learning
X
Xudong Wang
School of Computing, Kyung Hee University, Yongin-si, South Korea
J
Jiehui Xie
School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, China
P
Pengcheng Zheng
School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, China
H
Haoyu Wang
College of Computer Science and Information Engineering, Harbin Normal University, Harbin, 150025, China
Sungyoung Lee
Sungyoung Lee
Computer Science and Engineering, Kyung Hee University
Artificial IntelligenceBigdataKnowledge BaseHealthcare Platform
C
Chi-lok Andy Tai
College of Professional and Continuing Education, The Hong Kong Polytechnic University, Hong Kong, China
Y
Yang Yang
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
H
Heng Tao Shen
School of Computer Science and Technology, Tongji University, Shanghai, China