Mitigating Information Loss under High Pruning Rates for Efficient Large Vision Language Models

📅 2025-08-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the significant performance degradation of Large Vision-Language Models (LVLMs) under high pruning ratios—primarily caused by critical visual information loss—this paper proposes an Adaptive Content Compensation (ACC) framework. The method introduces, for the first time, an instruction-guided image captioning mechanism that dynamically reconstructs visual semantics and preserves salient information without requiring labeled data. It integrates a lightweight captioning model with a context-aware selector to generate descriptive textual proxies tailored to downstream tasks. Compensation is achieved via self-supervised training, enabling precise visual content recovery post-pruning while maintaining computational efficiency and rich semantic representation. Evaluated on seven mainstream benchmarks, ACC consistently outperforms existing pruning approaches: it reduces FLOPs by up to 6.5% while improving performance by up to 20.6%, thereby unifying efficient model compression with strong semantic fidelity.

Technology Category

Application Category

📝 Abstract
Despite the great success of Large Vision Language Models (LVLMs), their high computational cost severely limits their broad applications. The computational cost of LVLMs mainly stems from the visual sequence of the input, which consists of hundreds or even thousands of tokens. Although existing methods have made progress by removing redundant tokens, they suffer from severe performance degradation with high pruning rates due to the loss of visual information. In this paper, we propose an Adaptive Content Compensation Method (ACCM), which can effectively mitigate the visual information loss via an image caption. Specifically, ACCM comprises two key components: a lightweight caption model and a selector. Firstly the caption model generates question-related descriptions under the guidance of the user instruction. Then the selector further identifies a contextually appropriate caption from multiple candidates. Leveraging self-supervised learning, our modules could be learned efficiently without any human or automated labeling. We conduct extensive experiments across seven benchmarks and the results show that ACCM significantly outperforms existing methods with lower FLOPs (e.g., surpassing SOTA by 20.6% with 6.5% fewer FLOPs).
Problem

Research questions and friction points this paper is trying to address.

Reducing computational cost in Large Vision Language Models
Mitigating visual information loss during high token pruning
Improving model efficiency without performance degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Content Compensation Method mitigates loss
Lightweight caption model generates question-related descriptions
Selector picks contextually appropriate captions efficiently
🔎 Similar Papers
No similar papers found.
M
Mingyu Fu
Northwestern Polytechnical University, Xi’an, Shaanxi, China; National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, Xi’an, Shaanxi, China
W
Wei Suo
Northwestern Polytechnical University, Xi’an, Shaanxi, China; National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, Xi’an, Shaanxi, China
J
Ji Ma
Northwestern Polytechnical University, Xi’an, Shaanxi, China; National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, Xi’an, Shaanxi, China
Lin Yuanbo Wu
Lin Yuanbo Wu
Swansea University
Computer VisionAI GenerationTrustworthy AIAutonomous SystemEmbodied Visual Intelligence
P
Peng Wang
Northwestern Polytechnical University, Xi’an, Shaanxi, China; National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, Xi’an, Shaanxi, China
Yanning Zhang
Yanning Zhang
Northwestern Polytechnical University
Computer Vision