On the Limits of Token Reduction for Efficient Unified Vision Language Training

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the high computational cost of unified vision-language models during joint training and the challenge that existing token compression methods struggle to jointly optimize both understanding and generation tasks. The study identifies a pronounced asymmetry in how these two task types depend on image tokens and demonstrates that task-specific compression disrupts cross-task parameter sharing and undermines performance synergy. To resolve this, the authors propose a coordination-aware acceleration paradigm that preserves the shared model architecture while integrating hierarchical attention analysis with task-adaptive token pruning. Experiments show that this approach effectively balances efficiency and performance within a unified training framework, avoiding the loss of synergistic gains caused by independent compression strategies.

📝 Abstract

Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but their joint training is computationally expensive and largely overlooked from an efficiency perspective. In this work, we study the feasibility and limits of token-reduction-based acceleration for unified VLM training. Through a systematic analysis of layerwise attention allocation, we uncover a fundamental asymmetry: visual understanding exhibits substantial late-layer visual redundancy, whereas visual generation maintains persistent dependence on image tokens across depth. Guided by this observation, we design task-specific accelerators that selectively reduce image-token computation for each objective. While these methods achieve significant efficiency gains in isolated settings, we observe a consistent synergy loss under unified training -- task-specific token dropping necessitates divergent parameter pathways and eliminates the mutual performance gains typically observed in joint optimization. Our findings suggest that efficient unified modeling requires preserving shared cross-task structures, highlighting the need for synergy-aware acceleration strategies. Project page: https://chicychen.github.io/TokenReductionUnifiedVLM/.

Problem

Research questions and friction points this paper is trying to address.

unified vision-language models

token reduction

training efficiency

task synergy

visual understanding and generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

token reduction

unified vision-language models

attention asymmetry