🤖 AI Summary
To address the significant performance degradation of Large Vision-Language Models (LVLMs) under high pruning ratios—primarily caused by critical visual information loss—this paper proposes an Adaptive Content Compensation (ACC) framework. The method introduces, for the first time, an instruction-guided image captioning mechanism that dynamically reconstructs visual semantics and preserves salient information without requiring labeled data. It integrates a lightweight captioning model with a context-aware selector to generate descriptive textual proxies tailored to downstream tasks. Compensation is achieved via self-supervised training, enabling precise visual content recovery post-pruning while maintaining computational efficiency and rich semantic representation. Evaluated on seven mainstream benchmarks, ACC consistently outperforms existing pruning approaches: it reduces FLOPs by up to 6.5% while improving performance by up to 20.6%, thereby unifying efficient model compression with strong semantic fidelity.
📝 Abstract
Despite the great success of Large Vision Language Models (LVLMs), their high computational cost severely limits their broad applications. The computational cost of LVLMs mainly stems from the visual sequence of the input, which consists of hundreds or even thousands of tokens. Although existing methods have made progress by removing redundant tokens, they suffer from severe performance degradation with high pruning rates due to the loss of visual information. In this paper, we propose an Adaptive Content Compensation Method (ACCM), which can effectively mitigate the visual information loss via an image caption. Specifically, ACCM comprises two key components: a lightweight caption model and a selector. Firstly the caption model generates question-related descriptions under the guidance of the user instruction. Then the selector further identifies a contextually appropriate caption from multiple candidates. Leveraging self-supervised learning, our modules could be learned efficiently without any human or automated labeling. We conduct extensive experiments across seven benchmarks and the results show that ACCM significantly outperforms existing methods with lower FLOPs (e.g., surpassing SOTA by 20.6% with 6.5% fewer FLOPs).