🤖 AI Summary
Large vision-language models (LVLMs) suffer from high inference overhead due to dense visual token representations.
Method: This paper proposes a training-free, two-stage token compression framework. In the first stage, redundant low-level visual features are pruned via intra-modal visual self-attention; in the second stage, task-irrelevant tokens are filtered using cross-modal vision–text attention—enabling, for the first time, global information flow–guided, staged dynamic pruning.
Contribution/Results: The method departs from conventional single-stage, local pruning paradigms and maintains or even improves accuracy under high pruning ratios. It is architecture-agnostic and plug-and-play, achieving up to 2.5× inference speedup on mainstream benchmarks while preserving or exceeding the original model’s accuracy.
📝 Abstract
Although large vision-language models (LVLMs) leverage rich visual token representations to achieve strong performance on multimodal tasks, these tokens also introduce significant computational overhead during inference. Existing training-free token pruning methods typically adopt a single-stage strategy, focusing either on visual self-attention or visual-textual cross-attention. However, such localized perspectives often overlook the broader information flow across the model, leading to substantial performance degradation, especially under high pruning ratios. In this work, we propose STAR (Stage-wise Attention-guided token Reduction), a training-free, plug-and-play framework that approaches token pruning from a global perspective. Instead of pruning at a single point, STAR performs attention-guided reduction in two complementary stages: an early-stage pruning based on visual self-attention to remove redundant low-level features, and a later-stage pruning guided by cross-modal attention to discard task-irrelevant tokens. This holistic approach allows STAR to significantly reduce computational cost while better preserving task-critical information. Extensive experiments across multiple LVLM architectures and benchmarks show that STAR achieves strong acceleration while maintaining comparable, and in some cases even improved performance.