Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study

📅 2025-07-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address excessive computational and memory overhead in deploying multimodal large language models (MLLMs), this paper proposes a compression framework integrating structured pruning with low-data-dependency recovery. Methodologically, it systematically explores inter-layer and intra-layer width-wise structured pruning strategies—novel in MLLM compression—and introduces a lightweight recovery mechanism: fine-tuning only the multimodal projector via supervised fine-tuning and hidden-state knowledge distillation using as little as 5% of the original training data. A key empirical finding is that width pruning outperforms depth pruning under resource-constrained settings. Experiments on LLaVA, Bunny, and other MLLMs demonstrate that compressed models retain over 95% of original task performance while significantly reducing inference latency and GPU memory footprint. This work establishes a scalable, data-efficient, and practically deployable compression paradigm for MLLMs.

Technology Category

Application Category

📝 Abstract
While Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose significant barriers to practical deployment. Current parameter reduction techniques primarily involve training MLLMs from Small Language Models (SLMs), but these methods offer limited flexibility and remain computationally intensive. To address this gap, we propose to directly compress existing MLLMs through structural pruning combined with efficient recovery training. Specifically, we investigate two structural pruning paradigms--layerwise and widthwise pruning--applied to the language model backbone of MLLMs, alongside supervised finetuning and knowledge distillation. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios with limited computational resources or insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels (< 20%). Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved with as little as 5% of the original training data, while retaining over 95% of the original performance. Through empirical study on two representative MLLMs, i.e., LLaVA-v1.5-7B and Bunny-v1.0-3B, this study offers actionable insights for practitioners aiming to compress MLLMs effectively without extensive computation resources or sufficient data.
Problem

Research questions and friction points this paper is trying to address.

Compressing Multimodal Large Language Models efficiently
Exploring structural pruning and recovery techniques
Reducing computational and memory requirements effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structural pruning for MLLM compression
Efficient recovery with minimal data
Widthwise pruning excels in low-resource
🔎 Similar Papers
No similar papers found.