On the Limits of Token Reduction for Efficient Unified Vision Language Training

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

179K/year
🤖 AI Summary
This work addresses the high computational cost of unified vision-language models during joint training and the challenge that existing token compression methods struggle to jointly optimize both understanding and generation tasks. The study identifies a pronounced asymmetry in how these two task types depend on image tokens and demonstrates that task-specific compression disrupts cross-task parameter sharing and undermines performance synergy. To resolve this, the authors propose a coordination-aware acceleration paradigm that preserves the shared model architecture while integrating hierarchical attention analysis with task-adaptive token pruning. Experiments show that this approach effectively balances efficiency and performance within a unified training framework, avoiding the loss of synergistic gains caused by independent compression strategies.
📝 Abstract
Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but their joint training is computationally expensive and largely overlooked from an efficiency perspective. In this work, we study the feasibility and limits of token-reduction-based acceleration for unified VLM training. Through a systematic analysis of layerwise attention allocation, we uncover a fundamental asymmetry: visual understanding exhibits substantial late-layer visual redundancy, whereas visual generation maintains persistent dependence on image tokens across depth. Guided by this observation, we design task-specific accelerators that selectively reduce image-token computation for each objective. While these methods achieve significant efficiency gains in isolated settings, we observe a consistent synergy loss under unified training -- task-specific token dropping necessitates divergent parameter pathways and eliminates the mutual performance gains typically observed in joint optimization. Our findings suggest that efficient unified modeling requires preserving shared cross-task structures, highlighting the need for synergy-aware acceleration strategies. Project page: https://chicychen.github.io/TokenReductionUnifiedVLM/.
Problem

Research questions and friction points this paper is trying to address.

unified vision-language models
token reduction
training efficiency
task synergy
visual understanding and generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

token reduction
unified vision-language models
attention asymmetry
synergy-aware acceleration
efficient training