Why 1 + 1<1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering

πŸ“… 2025-05-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing vision token pruning methods employ static strategies, overlooking the dynamic trade-off between prompt alignment and visual fidelity across diverse tasksβ€”leading to unstable performance. This work establishes, for the first time, a closed-form error bound based on the Hausdorff distance, formally characterizing their intrinsic tension. We propose Multi-objective Balanced Covering (MoB), the first provably bounded, linearly scalable two-objective pruning framework. MoB formulates pruning as a budget-constrained covering optimization problem and enables adaptive objective balancing via greedy radius swapping. On LLaVA-1.5-7B, MoB retains only 11.1% of visual tokens while preserving 96.4% of downstream performance, achieving 1.3–1.5Γ— inference speedup. Moreover, MoB is architecture-agnostic and compatible with mainstream multimodal large language models (MLLMs), including Qwen2-VL and Video-LLaVA.

Technology Category

Application Category

πŸ“ Abstract
Existing visual token pruning methods target prompt alignment and visual preservation with static strategies, overlooking the varying relative importance of these objectives across tasks, which leads to inconsistent performance. To address this, we derive the first closed-form error bound for visual token pruning based on the Hausdorff distance, uniformly characterizing the contributions of both objectives. Moreover, leveraging $epsilon$-covering theory, we reveal an intrinsic trade-off between these objectives and quantify their optimal attainment levels under a fixed budget. To practically handle this trade-off, we propose Multi-Objective Balanced Covering (MoB), which reformulates visual token pruning as a bi-objective covering problem. In this framework, the attainment trade-off reduces to budget allocation via greedy radius trading. MoB offers a provable performance bound and linear scalability with respect to the number of input visual tokens, enabling adaptation to challenging pruning scenarios. Extensive experiments show that MoB preserves 96.4% of performance for LLaVA-1.5-7B using only 11.1% of the original visual tokens and accelerates LLaVA-Next-7B by 1.3-1.5$ imes$ with negligible performance loss. Additionally, evaluations on Qwen2-VL and Video-LLaVA confirm that MoB integrates seamlessly into advanced MLLMs and diverse vision-language tasks.
Problem

Research questions and friction points this paper is trying to address.

Static strategies overlook varying importance of objectives in visual token pruning
Lack of uniform error bound for visual token pruning performance
Intrinsic trade-off between prompt alignment and visual preservation not quantified
Innovation

Methods, ideas, or system contributions that make the work stand out.

Closed-form error bound using Hausdorff distance
Bi-objective covering problem via MoB
Greedy radius trading for budget allocation
πŸ”Ž Similar Papers
No similar papers found.
Yangfu Li
Yangfu Li
East China Normal University
Deep learning
H
Hongjian Zhan
Shanghai Key Laboratory of Multidimensional Information Processing, School of Communications and Electronic Engineering, East China Normal University
T
Tianyi Chen
School of Mathematical Sciences, Shanghai Jiao Tong University
Q
Qi Liu
Shanghai Key Laboratory of Multidimensional Information Processing, School of Communications and Electronic Engineering, East China Normal University
Y
Yue Lu
Shanghai Key Laboratory of Multidimensional Information Processing, School of Communications and Electronic Engineering, East China Normal University