ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

📅 2025-04-01

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

To address the high computational overhead in multimodal large language models (MLLMs) caused by excessive visual tokens, this paper proposes a training-free visual token optimization method. The core innovation is the first formal definition and quantification of *Layer Contribution (LC)*, enabling precise identification of network layers redundant for visual information processing and facilitating targeted freezing of visual token updates. Integrated with the LLaVA-NeXT architecture, the method freezes visual token updates in approximately 60% of layers in LLaVA-NeXT-13B, reducing FLOPs by 50% and significantly accelerating inference. Remarkably, it achieves superior performance over the baseline on multiple vision understanding benchmarks. This work establishes a plug-and-play, fine-tuning-free lightweighting paradigm for efficient MLLM deployment—offering both substantial computational savings and enhanced accuracy.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV

Problem

Research questions and friction points this paper is trying to address.

Reduces computational costs in Multimodal Large Language Models

Identifies ineffective layers for visual token processing

Freezes visual tokens in redundant layers to maintain performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Freezes visual tokens in ineffective layers

Uses Layer Contribution metric for redundancy

Reduces FLOPs by 50% without performance loss

🔎 Similar Papers

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference