🤖 AI Summary
This work addresses the high computational and memory costs incurred by vision-language models when processing long visual token sequences. Existing pruning methods, which rely on local heuristics, often suffer from positional bias and fragmented information retention, making it difficult to preserve critical semantics under high compression ratios. To overcome these limitations, the authors propose a training-free, plug-and-play pruning approach that introduces singular value decomposition (SVD) into visual token selection for the first time. By leveraging statistical leverage scores, the method identifies the top-K tokens that contribute most significantly to the global principal components. This strategy effectively circumvents the shortcomings of local heuristics and achieves substantial performance gains over existing techniques—even under extreme compression settings retaining only 16 or 32 tokens—while maintaining strong model performance on detail-rich images.
📝 Abstract
Vision-Language Models (VLM) have revolutionized multimodal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences of vision tokens. Many existing methods rely on local heuristics, such as attention scores or token norms. However, these criteria suffer from positional bias and information dispersion, limiting their ability to preserve essential content at high pruning ratios and leading to performance degradation on visually detailed images. To address these issues, we propose SVD-Prune, a trainingfree, plug-and-play token pruning method based on Singular Value Decomposition. It decomposes the vision token feature matrix and selects the top-K tokens using statistical leverage scores, ensuring only tokens contributing most to the dominant global variance are preserved. Experiments show that SVD-Prune consistently outperforms prior pruning methods under extreme vision token budgets, maintaining strong performance even with 32 and 16 vision tokens.