Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost incurred by the massive visual tokens generated by multimodal large language models when processing ultra-high-resolution remote sensing imagery, a challenge exacerbated by existing compression methods that rely on static strategies and struggle to simultaneously preserve semantic meaning and geometric structure. To overcome this limitation, the authors propose DualComp, a novel framework featuring the first semantic-geometry dual adaptive compression mechanism. Leveraging a lightweight pretrained router for dynamic guidance, DualComp decouples feature processing into two parallel streams: a Spatially-Contiguous Semantic Aggregator (SCSA) performs size-adaptive clustering to compress background regions, while an Instruction-Guided Structure Recoverer (IGSR) reconstructs spatial skeletons via greedy path tracing. Requiring no additional training, DualComp adaptively selects compression strategies per task, achieving significant gains in both inference efficiency and accuracy on XLRS-Bench while enabling high-fidelity, low-overhead remote sensing image understanding.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated immense potential in Earth observation. However, the massive visual tokens generated when processing Ultra-High-Resolution (UHR) imagery introduce prohibitive computational overhead, severely bottlenecking their inference efficiency. Existing visual token compression methods predominantly adopt static and uniform compression strategies, neglecting the inherent "Semantic-Geometric Duality" in remote sensing interpretation tasks. Specifically, object semantic tasks focus on the abstract semantics of objects and benefit from aggressive background pruning, whereas scene geometric tasks critically rely on the integrity of spatial topology. To address this challenge, we propose DualComp, a task-adaptive dual-stream token compression framework. Dynamically guided by a lightweight pre-trained router, DualComp decouples feature processing into two dedicated pathways. In the object semantic stream, the Spatially-Contiguous Semantic Aggregator (SCSA) utilizes size-adaptive clustering to aggregates redundant background while protecting small object. In the scene geometric stream, the Instruction-Guided Structure Recoverer (IGSR) introduces a greedy path-tracing topology completion mechanism to reconstruct spatial skeletons. Experiments on the UHR remote sensing benchmark XLRS-Bench demonstrate that DualComp accomplishes high-fidelity remote sensing interpretation at an exceptionally low computational cost, achieving simultaneous improvements in both efficiency and accuracy.
Problem

Research questions and friction points this paper is trying to address.

Ultra-High-Resolution
Visual Token Compression
Semantic-Geometric Duality
Remote Sensing Understanding
Computational Overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-Geometric Duality
Visual Token Compression
Task-Adaptive Framework
Ultra-High-Resolution Remote Sensing
Training-Free Compression
🔎 Similar Papers
No similar papers found.
Y
Yueying Li
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
Fengxiang Wang
Fengxiang Wang
National University of Defense Technology
Computer VisionRemote Sensing
Y
Yan Li
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
M
Mingshuo Chen
School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
Mengying Zhao
Mengying Zhao
Shandong University
embedded system
L
Long Lan
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China