🤖 AI Summary
Multimodal large language models (MLLMs) exhibit insufficient visual grounding and spatial reasoning capabilities in remote sensing (RS), primarily due to fundamental limitations of CLIP-style vision encoders in representing RS imagery—specifically, their inability to distinguish semantically identical but texturally, scale-, and viewpoint-distinct “CLIP-blind pairs.”
Method: We introduce RSMMVP, the first RS-specific multimodal visual pattern benchmark, comprising four modules: high-resolution RS image construction, blind-pair identification, similarity discrimination, and RS-oriented visual question answering (RS-VQA).
Contribution/Results: Experiments reveal substantial performance degradation of mainstream MLLMs on RS-VQA tasks. Quantitative analysis precisely localizes CLIP’s weaknesses in modeling semantic invariances across texture, scale, and viewpoint—providing critical empirical evidence and standardized evaluation protocols for designing RS-customized vision encoders.
📝 Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks but their remote sensing (RS) counterpart are relatively under explored. Unlike natural images, RS imagery presents unique challenges that current MLLMs struggle to handle, particularly in visual grounding and spatial reasoning. This study investigates the limitations of CLIP-based MLLMs in RS, highlighting their failure to differentiate visually distinct yet semantically similar RS images. To address this, we introduce a remote sensing multimodal visual patterns (RSMMVP) benchmark. It is designed to evaluate MLLMs in RS tasks by identifying the CLIP-blind pairs, where CLIP-based models incorrectly assign high similarity scores to visually distinct RS images. Through a visual question answering (VQA) evaluation, we analyze the performance of state-of-the-art MLLMs, revealing significant limitations in RS specific representation learning. The results provide valuable insights into the weaknesses of CLIP-based visual encoding and offer a foundation for future research to develop more effective MLLMs tailored for remote sensing applications.