Mitigating Information Loss under High Pruning Rates for Efficient Large Vision Language Models

📅 2025-08-02

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

To address the significant performance degradation of Large Vision-Language Models (LVLMs) under high pruning ratios—primarily caused by critical visual information loss—this paper proposes an Adaptive Content Compensation (ACC) framework. The method introduces, for the first time, an instruction-guided image captioning mechanism that dynamically reconstructs visual semantics and preserves salient information without requiring labeled data. It integrates a lightweight captioning model with a context-aware selector to generate descriptive textual proxies tailored to downstream tasks. Compensation is achieved via self-supervised training, enabling precise visual content recovery post-pruning while maintaining computational efficiency and rich semantic representation. Evaluated on seven mainstream benchmarks, ACC consistently outperforms existing pruning approaches: it reduces FLOPs by up to 6.5% while improving performance by up to 20.6%, thereby unifying efficient model compression with strong semantic fidelity.

Technology Category

Application Category

📝 Abstract

Despite the great success of Large Vision Language Models (LVLMs), their high computational cost severely limits their broad applications. The computational cost of LVLMs mainly stems from the visual sequence of the input, which consists of hundreds or even thousands of tokens. Although existing methods have made progress by removing redundant tokens, they suffer from severe performance degradation with high pruning rates due to the loss of visual information. In this paper, we propose an Adaptive Content Compensation Method (ACCM), which can effectively mitigate the visual information loss via an image caption. Specifically, ACCM comprises two key components: a lightweight caption model and a selector. Firstly the caption model generates question-related descriptions under the guidance of the user instruction. Then the selector further identifies a contextually appropriate caption from multiple candidates. Leveraging self-supervised learning, our modules could be learned efficiently without any human or automated labeling. We conduct extensive experiments across seven benchmarks and the results show that ACCM significantly outperforms existing methods with lower FLOPs (e.g., surpassing SOTA by 20.6% with 6.5% fewer FLOPs).

Problem

Research questions and friction points this paper is trying to address.

Reducing computational cost in Large Vision Language Models

Mitigating visual information loss during high token pruning

Improving model efficiency without performance degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Content Compensation Method mitigates loss

Lightweight caption model generates question-related descriptions

Selector picks contextually appropriate captions efficiently

🔎 Similar Papers

No similar papers found.