Dynamic Token Reduction during Generation for Vision Language Models

📅 2025-01-24

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Current vision-language models suffer from high decoder attention computation overhead and severe visual token redundancy in image captioning, leading to slow inference. Existing approaches (e.g., FASTV, VTW) perform only a single, static token pruning step and thus cannot adapt dynamically during autoregressive generation. This paper proposes DyRate, a dynamic progressive token compression strategy. DyRate introduces a lightweight rate prediction network that models the evolution of attention distributions across the entire generation trajectory, enabling step-wise adaptive visual token pruning and compression. By aligning compression with the changing attention dynamics, DyRate significantly reduces computational cost while preserving caption quality. Experimental results demonstrate that DyRate outperforms state-of-the-art single-step pruning methods in both efficiency and generation fidelity, achieving superior overall performance.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have achieved notable success in multimodal tasks but face practical limitations due to the quadratic complexity of decoder attention mechanisms and autoregressive generation. Existing methods like FASTV and VTW have achieved notable results in reducing redundant visual tokens, but these approaches focus on pruning tokens in a single forward pass without systematically analyzing the redundancy of visual tokens throughout the entire generation process. In this paper, we introduce a dynamic pruning strategy tailored for VLMs, namedDynamic Rate (DyRate), which progressively adjusts the compression rate during generation. Our analysis of the distribution of attention reveals that the importance of visual tokens decreases throughout the generation process, inspiring us to adopt a more aggressive compression rate. By integrating a lightweight predictor based on attention distribution, our approach enables flexible adjustment of pruning rates based on the attention distribution. Our experimental results demonstrate that our method not only reduces computational demands but also maintains the quality of responses.

Problem

Research questions and friction points this paper is trying to address.

Visual Language Models

Image Captioning

Efficiency Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

DyRate

Dynamic Vocabulary Adjustment

Intelligent Detail Reduction

🔎 Similar Papers

HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments