Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

📅 2024-12-02
📈 Citations: 11
Influential: 1
📄 PDF
🤖 AI Summary
To address excessive computational overhead caused by visual token redundancy in large vision-language models (LVLMs), this paper proposes VisPruner—a training-free, plug-and-play visual token pruning method. Unlike existing approaches relying on unreliable text–vision cross-attention, VisPruner is the first to leverage purely intrinsic visual structure: it jointly assesses visual self-attention importance and performs cosine-similarity-based clustering for deduplication, thereby synergistically preserving both token importance and diversity. Its modular design enables zero-shot cross-architecture transfer. Evaluated on LLaVA-1.5-7B, VisPruner reduces FLOPs by 91% and inference latency by 75%, while maintaining performance on par with the full model—significantly outperforming cross-modal attention–based pruning baselines.

Technology Category

Application Category

📝 Abstract
Large vision-language models (LVLMs) generally contain significantly more visual tokens than their textual counterparts, resulting in a considerable computational burden. Recent efforts have been made to tackle this issue by pruning visual tokens early within the language model. Most existing works use attention scores between text and visual tokens to assess the importance of visual tokens. However, in this study, we first analyze the text-visual attention in the language model and find that this score is not an ideal indicator for token pruning. Based on the analysis, We propose VisPruner, a plug-and-play method that utilizes visual cues for more effective token pruning in LVLMs. Specifically, we first use visual attention to select a limited number of significant tokens. Then, we remove duplicate tokens from the remaining ones based on their similarity. By retaining diverse tokens alongside the initially selected important tokens, we maximally preserve the visual information of the input image. Experimental results demonstrate that our VisPruner sustains strong performance across various VLM architectures and reduction ratios, significantly outperforming existing methods based on text-visual attention. Notably, without any training, VisPruner can reduce the FLOPs of LLaVA-1.5-7B by 91% and inference latency by 75%, while maintaining comparable performance. Our code is available at https://github.com/Theia-4869/VisPruner.
Problem

Research questions and friction points this paper is trying to address.

Excessive visual tokens increase computational burden in VLMs
Text-visual attention scores are unreliable for token pruning
Proposing VisPruner for efficient token pruning using visual cues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes visual cues for token pruning
Combines visual attention and similarity
Reduces FLOPs and latency significantly
🔎 Similar Papers
No similar papers found.
Qizhe Zhang
Qizhe Zhang
School of Computer Science, Peking University
Vision Language ModelComputer VisionMachine Learning
A
Aosong Cheng
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
M
Ming Lu
Intel Labs China
Z
Zhiyong Zhuo
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
M
Minqi Wang
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
J
Jiajun Cao
National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
S
Shaobo Guo
ByteDance
Q
Qi She
ByteDance
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models