ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational overhead in multimodal large language models (MLLMs) caused by excessive visual tokens, this paper proposes a training-free visual token optimization method. The core innovation is the first formal definition and quantification of *Layer Contribution (LC)*, enabling precise identification of network layers redundant for visual information processing and facilitating targeted freezing of visual token updates. Integrated with the LLaVA-NeXT architecture, the method freezes visual token updates in approximately 60% of layers in LLaVA-NeXT-13B, reducing FLOPs by 50% and significantly accelerating inference. Remarkably, it achieves superior performance over the baseline on multiple vision understanding benchmarks. This work establishes a plug-and-play, fine-tuning-free lightweighting paradigm for efficient MLLM deployment—offering both substantial computational savings and enhanced accuracy.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV
Problem

Research questions and friction points this paper is trying to address.

Reduces computational costs in Multimodal Large Language Models
Identifies ineffective layers for visual token processing
Freezes visual tokens in redundant layers to maintain performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Freezes visual tokens in ineffective layers
Uses Layer Contribution metric for redundancy
Reduces FLOPs by 50% without performance loss
🔎 Similar Papers
No similar papers found.
Q
Qianhao Yuan
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Qingyu Zhang
Qingyu Zhang
Institute of Software, Chinese Academy of Sciences
Yanjiang Liu
Yanjiang Liu
UCAS
J
Jiawei Chen
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Yaojie Lu
Yaojie Lu
Institute of Software, Chinese Academy of Sciences
Information ExtractionLarge Language Models
H
Hongyu Lin
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences
J
Jia Zheng
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences
X
Xianpei Han
Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences
Le Sun
Le Sun
Institute of Software, CAS
information_retrievalnatural_language_processing