Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from high computational overhead due to redundant visual tokens. Prior work compresses only intermediate layers, overlooking input-layer redundancy. Method: This paper introduces the first input-stage visual token compression framework for MLLMs, grounded in interpretability. It proposes a lightweight convolutional mapping method that dynamically prunes visual tokens at inference time by modeling attention maps and quantifying token importance—bypassing full-model evaluation and enabling cross-model generalization. Contribution/Results: Our approach achieves an average 50% reduction in visual tokens across three mainstream MLLMs while retaining over 96% of original performance. Crucially, it generalizes to unseen token scales without retraining. To our knowledge, this is the first framework targeting the *input layer* of LLMs that jointly optimizes efficiency, interpretability, and model-agnostic generalizability for visual token compression.

Technology Category

Application Category

📝 Abstract

Existing Multimodal Large Language Models (MLLMs) process a large number of visual tokens, leading to significant computational costs and inefficiency. Previous works generally assume that all visual tokens are necessary in the shallow layers of LLMs, and therefore token compression typically occurs in intermediate layers. In contrast, our study reveals an interesting insight: with proper selection, token compression is feasible at the input stage of LLM with negligible performance loss. Specifically, we reveal that explainability methods can effectively evaluate the importance of each visual token with respect to the given instruction, which can well guide the token compression. Furthermore, we propose to learn a mapping from the attention map of the first LLM layer to the explanation results, thereby avoiding the need for a full inference pass and facilitating practical deployment. Interestingly, this mapping can be learned using a simple and lightweight convolutional network, whose training is efficient and independent of MLLMs. Extensive experiments on 10 image and video benchmarks across three leading MLLMs (Qwen2-VL, LLaVA-OneVision, and VILA1.5) demonstrate the effectiveness of our approach, e.g., pruning 50% visual tokens while retaining more than 96% of the original performance across all benchmarks for all these three MLLMs. It also exhibits strong generalization, even when the number of tokens in inference far exceeds that used in training.

Problem

Research questions and friction points this paper is trying to address.

Reduce computational costs in MLLMs via early token compression

Use explainability methods to guide visual token selection

Learn attention-to-explanation mapping for efficient deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token compression at LLM input stage

Explainability guides visual token selection

Lightweight CNN learns attention-explanation mapping

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models