Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the deployment inefficiency caused by the escalating scale of vision-language models (VLMs) and the lack of theoretical guarantees on semantic information preservation in existing compression methods, this paper proposes InfoPrune—a novel adaptive structured pruning framework grounded in the information bottleneck principle. Its core innovation lies in the first joint modeling of entropy-based effective rank (eRank) and Kolmogorov–Smirnov (KS) distance to quantify the information contribution of attention heads, thereby establishing a unified pruning criterion that simultaneously enforces structural sparsity and semantic fidelity. InfoPrune supports both training-aware pruning and training-free feedforward compression. Evaluated on VQAv2, TextVQA, and GQA, it achieves up to 3.2× FLOPs reduction and 1.8× inference speedup with less than 0.5% accuracy degradation—substantially outperforming state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract
Recent advances in vision-language models (VLMs) have shown remarkable performance across multimodal tasks, yet their ever-growing scale poses severe challenges for deployment and efficiency. Existing compression methods often rely on heuristic importance metrics or empirical pruning rules, lacking theoretical guarantees about information preservation. In this work, we propose InfoPrune, an information-theoretic framework for adaptive structural compression of VLMs. Grounded in the Information Bottleneck principle, we formulate pruning as a trade-off between retaining task-relevant semantics and discarding redundant dependencies. To quantify the contribution of each attention head, we introduce an entropy-based effective rank (eRank) and employ the Kolmogorov--Smirnov (KS) distance to measure the divergence between original and compressed structures. This yields a unified criterion that jointly considers structural sparsity and informational efficiency. Building on this foundation, we further design two complementary schemes: (1) a training-based head pruning guided by the proposed information loss objective, and (2) a training-free FFN compression via adaptive low-rank approximation. Extensive experiments on VQAv2, TextVQA, and GQA demonstrate that InfoPrune achieves up to 3.2x FLOP reduction and 1.8x acceleration with negligible performance degradation, establishing a theoretically grounded and practically effective step toward efficient multimodal large models.
Problem

Research questions and friction points this paper is trying to address.

Compressing large vision-language models to reduce computational costs and improve deployment efficiency
Developing theoretically grounded pruning methods instead of heuristic approaches for model compression
Preserving task-relevant semantic information while removing redundant model components
Innovation

Methods, ideas, or system contributions that make the work stand out.

Information-theoretic framework for adaptive structural compression
Entropy-based effective rank and KS distance for attention pruning
Training-free FFN compression via adaptive low-rank approximation
Z
Zhaoqi Xu
School of Artificial Intelligence, Beijing Normal University
Y
Yingying Zhang
Zhongtai Securities Institute for Financial Studies, Shandong University
J
Jian Li
School of Artificial Intelligence, Beijing Normal University
Jianwei Guo
Jianwei Guo
Beijing Normal University
computer graphicsgeometric processing
Qiannan Zhu
Qiannan Zhu
School of Artificial Intelligence, Beijing Normal University
knowledge graphrecommendation systeminformation retrieval
H
Hua Huang
School of Artificial Intelligence, Beijing Normal University