ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the high computational overhead caused by visual token redundancy in large vision-language models (LVLMs), this paper proposes a training-free, two-stage visual token compression framework. Methodologically, it is the first to jointly leverage k-center diversity selection and task-aware attention gating for token pruning: Stage I selects diverse representative tokens via k-center clustering in feature space; Stage II dynamically prunes low-relevance tokens at decoder layers, preserving information integrity while maintaining compatibility with efficient attention operators such as FlashAttention. Experiments show an average 90% reduction in visual tokens, 2.6× inference speedup, and 95.1% performance retention across diverse benchmarks. The core contribution lies in decoupling diversity modeling from task relevance modeling—enabling zero-shot, high-fidelity, and adaptive visual token compression without any fine-tuning.

Technology Category

Application Category

📝 Abstract

The representation of visual inputs of large vision-language models (LVLMs) usually involves substantially more tokens than that of textual inputs, leading to significant computational overhead. Several recent studies strive to mitigate this issue by either conducting token compression to prune redundant visual tokens or guiding them to bypass certain computational stages. While most existing work exploits token importance as the redundancy indicator, our study reveals that two largely neglected factors, namely, the diversity of retained visual tokens and their task relevance, often offer more robust criteria in token pruning. To this end, we design ToDRE, a two-stage and training-free token compression framework that achieves superior performance by pruning Tokens based on token Diversity and token-task RElevance. Instead of pruning redundant tokens, ToDRE introduces a greedy k-center algorithm to select and retain a small subset of diverse visual tokens after the vision encoder. Additionally, ToDRE addresses the"information migration"by further eliminating task-irrelevant visual tokens within the decoder of large language model (LLM). Extensive experiments show that ToDRE effectively reduces 90% of visual tokens after vision encoder and adaptively prunes all visual tokens within certain LLM's decoder layers, leading to a 2.6x speed-up in total inference time while maintaining 95.1% of model performance and excellent compatibility with efficient attention operators.

Problem

Research questions and friction points this paper is trying to address.

Reduce computational overhead in large vision-language models

Prune redundant visual tokens via diversity and task relevance

Maintain model performance while speeding up inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token pruning via diversity and task relevance

Greedy k-center algorithm for token selection

Adaptive pruning in LLM decoder layers

🔎 Similar Papers

No similar papers found.

Authors to Follow