MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video large language models (VLLMs) suffer from high inference overhead due to visual token redundancy, while existing pruning methods neglect video dynamics and temporal dependencies. To address this, we propose a training-free, segment-token joint pruning framework: first, semantic segmentation is performed based on inter-frame similarity; second, computational budgets are dynamically allocated at both segment and token levels to preserve inter-frame uniqueness and intra-frame diversity; third, temporal-aware density peak clustering (DPC) is introduced for precise token pruning. Evaluated on LLaVA-OneVision-7B, our method reduces visual tokens by 75%, achieves a 3.9× speedup in the prefill stage, and retains over 99.5% of original performance. This work marks the first effort to enable efficient visual token compression explicitly guided by modeling video temporal structure.

Technology Category

Application Category

📝 Abstract
Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose MMG-Vid, a novel training-free visual token pruning framework that removes redundancy by Maximizing Marginal Gains at both segment-level and token-level. Specifically, we first divide the video into segments based on frame similarity, and then dynamically allocate the token budget for each segment to maximize the marginal gain of each segment. Subsequently, we propose a temporal-guided DPC algorithm that jointly models inter-frame uniqueness and intra-frame diversity, thereby maximizing the marginal gain of each token. By combining both stages, MMG-Vid can maximize the utilization of the limited token budget, significantly improving efficiency while maintaining strong performance. Extensive experiments demonstrate that MMG-Vid can maintain over 99.5% of the original performance, while effectively reducing 75% visual tokens and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B. Code will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational cost of video tokens in VLLMs
Addressing dynamic characteristics in video frame dependencies
Maximizing token budget efficiency while maintaining performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Segment-level token budget allocation
Temporal-guided DPC algorithm for tokens
Training-free visual token pruning framework
🔎 Similar Papers
No similar papers found.
J
Junpeng Ma
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University; Fudan University
Qizhe Zhang
Qizhe Zhang
School of Computer Science, Peking University
Vision Language ModelComputer VisionMachine Learning
M
Ming Lu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Zhibin Wang
Zhibin Wang
Zhejiang University
new particle formationaerosolshygroscopicityblack carbon
Q
Qiang Zhou
Taobao & Tmall Group of Alibaba
Jun Song
Jun Song
Shenzhen University
nanophotonics
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models