🤖 AI Summary
Video large language models (VLLMs) suffer from high inference overhead due to visual token redundancy, while existing pruning methods neglect video dynamics and temporal dependencies. To address this, we propose a training-free, segment-token joint pruning framework: first, semantic segmentation is performed based on inter-frame similarity; second, computational budgets are dynamically allocated at both segment and token levels to preserve inter-frame uniqueness and intra-frame diversity; third, temporal-aware density peak clustering (DPC) is introduced for precise token pruning. Evaluated on LLaVA-OneVision-7B, our method reduces visual tokens by 75%, achieves a 3.9× speedup in the prefill stage, and retains over 99.5% of original performance. This work marks the first effort to enable efficient visual token compression explicitly guided by modeling video temporal structure.
📝 Abstract
Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose MMG-Vid, a novel training-free visual token pruning framework that removes redundancy by Maximizing Marginal Gains at both segment-level and token-level. Specifically, we first divide the video into segments based on frame similarity, and then dynamically allocate the token budget for each segment to maximize the marginal gain of each segment. Subsequently, we propose a temporal-guided DPC algorithm that jointly models inter-frame uniqueness and intra-frame diversity, thereby maximizing the marginal gain of each token. By combining both stages, MMG-Vid can maximize the utilization of the limited token budget, significantly improving efficiency while maintaining strong performance. Extensive experiments demonstrate that MMG-Vid can maintain over 99.5% of the original performance, while effectively reducing 75% visual tokens and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B. Code will be released soon.