🤖 AI Summary
This work addresses the high computational cost incurred by the massive visual tokens generated by multimodal large language models when processing ultra-high-resolution remote sensing imagery, a challenge exacerbated by existing compression methods that rely on static strategies and struggle to simultaneously preserve semantic meaning and geometric structure. To overcome this limitation, the authors propose DualComp, a novel framework featuring the first semantic-geometry dual adaptive compression mechanism. Leveraging a lightweight pretrained router for dynamic guidance, DualComp decouples feature processing into two parallel streams: a Spatially-Contiguous Semantic Aggregator (SCSA) performs size-adaptive clustering to compress background regions, while an Instruction-Guided Structure Recoverer (IGSR) reconstructs spatial skeletons via greedy path tracing. Requiring no additional training, DualComp adaptively selects compression strategies per task, achieving significant gains in both inference efficiency and accuracy on XLRS-Bench while enabling high-fidelity, low-overhead remote sensing image understanding.
📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated immense potential in Earth observation. However, the massive visual tokens generated when processing Ultra-High-Resolution (UHR) imagery introduce prohibitive computational overhead, severely bottlenecking their inference efficiency. Existing visual token compression methods predominantly adopt static and uniform compression strategies, neglecting the inherent "Semantic-Geometric Duality" in remote sensing interpretation tasks. Specifically, object semantic tasks focus on the abstract semantics of objects and benefit from aggressive background pruning, whereas scene geometric tasks critically rely on the integrity of spatial topology. To address this challenge, we propose DualComp, a task-adaptive dual-stream token compression framework. Dynamically guided by a lightweight pre-trained router, DualComp decouples feature processing into two dedicated pathways. In the object semantic stream, the Spatially-Contiguous Semantic Aggregator (SCSA) utilizes size-adaptive clustering to aggregates redundant background while protecting small object. In the scene geometric stream, the Instruction-Guided Structure Recoverer (IGSR) introduces a greedy path-tracing topology completion mechanism to reconstruct spatial skeletons. Experiments on the UHR remote sensing benchmark XLRS-Bench demonstrate that DualComp accomplishes high-fidelity remote sensing interpretation at an exceptionally low computational cost, achieving simultaneous improvements in both efficiency and accuracy.