Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the high computational cost of current vision-language models stemming from processing massive visual tokens, a challenge exacerbated by existing compression methods that struggle to simultaneously preserve subject fidelity and model contextual relationships. To overcome this, the authors propose SPpruner, the first token compression framework to incorporate the human visual perception mechanism of “focusing first, then contextualizing.” SPpruner introduces a subject-centric, progressive pruning paradigm: it first identifies critical subjects using a module that integrates visual saliency and semantic relevance, then aggregates local neighborhood context through structured scanning to reconstruct global dependencies. Evaluated on Qwen2.5-VL, SPpruner retains only 22.2% of tokens while achieving a 2.53× speedup; on LLaVA, it reduces FLOPs by 67% with merely a 0.6% accuracy drop, significantly outperforming current state-of-the-art methods.

📝 Abstract

Vision-Language Models (VLMs) face a bottleneck of prohibitive computational costs arising from massive visual token sequences during inference. Existing vision token reduction methods alleviate this burden, but they unintentionally preserve the isolated visual subject strictly aligned with the user's query, which fails to substantially explore salient subjects and their contextual relationships. In this paper, we propose SPpruner, a subject-centric progressive reduction paradigm that emulates the \textit{Focus-then-Context} mechanism of the human visual perception system. Specifically, we first construct a focus identification module to explicitly model the interplay between visual saliency and semantic relevance. Herein, it can excavate the comprehensive visual subject spectrum to ensure a high-fidelity representation of visual input. Subsequently, a context-aware structural scanning module is developed to aggregate contextual cues from neighboring regions. As such, it can effectively restore global relational dependencies to uphold the structural integrity of the preserved subjects. Extensive experiments demonstrate that our paradigm consistently outperforms SOTA methods, achieving up to 2.53 times speedup with only 22.2% of visual tokens retained in Qwen2.5-VL and a 67% FLOPs reduction on LLaVA with a negligible 0.6% accuracy drop.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

visual token reduction

computational cost

contextual relationships

visual saliency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Focus-then-Context

subject-centric

visual token reduction