PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models

📅 2025-04-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models suffer from high inference latency and excessive GPU memory consumption due to redundant visual tokens. To address this, we propose an early-layer joint token pruning and clustering framework. First, we design a lightweight, attention-free token importance metric for efficient saliency estimation. Second, we introduce Distance-Bounded Density Peak Clustering (DBDPC), a novel clustering algorithm that enhances geometric compactness while preserving semantic consistency. Third, we develop a FlashAttention-compatible differentiable pruning mechanism, enabling end-to-end training and seamless deployment. Extensive experiments demonstrate that our method retains over 99% of the original performance across multiple vision-language tasks, while significantly reducing inference latency and GPU memory usage. This work establishes a new paradigm for efficient vision-language modeling through structured visual token compression at early transformer layers.

Technology Category

Application Category

📝 Abstract
Visual Language Models require substantial computational resources for inference due to the additional input tokens needed to represent visual information. However, these visual tokens often contain redundant and unimportant information, resulting in an unnecessarily high number of tokens. To address this, we introduce PACT, a method that reduces inference time and memory usage by pruning irrelevant tokens and merging visually redundant ones at an early layer of the language model. Our approach uses a novel importance metric to identify unimportant tokens without relying on attention scores, making it compatible with FlashAttention. We also propose a novel clustering algorithm, called Distance Bounded Density Peak Clustering, which efficiently clusters visual tokens while constraining the distances between elements within a cluster by a predefined threshold. We demonstrate the effectiveness of PACT through extensive experiments.
Problem

Research questions and friction points this paper is trying to address.

Reduces redundant visual tokens in language models
Improves inference speed and memory efficiency
Introduces novel clustering for token merging
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prunes irrelevant tokens early in model
Uses novel importance metric without attention
Clusters tokens with distance-bounded algorithm
🔎 Similar Papers
No similar papers found.
M
M. Dhouib
LIX, École Polytechnique, IP Paris, France
Davide Buscaldi
Davide Buscaldi
Maître de conférences HDR, LIPN, Université Sorbonne Paris Nord
LLMsInformation RetrievalOntology LearningGeographic IRText Mining
S
Sonia Vanier
LIX, École Polytechnique, IP Paris, France
A
A. Shabou
DataLab Groupe, Crédit Agricole S.A, France