Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the computational inefficiency of video large language models caused by redundant visual tokens, a challenge exacerbated by existing pruning methods that struggle to balance spatiotemporal information retention with compression efficiency. To this end, the authors propose AOT, a novel training-free approach that uniquely integrates intra-frame token anchors—guided by local-global attention—and inter-frame keyframe anchors. By leveraging an optimal transport mechanism, AOT effectively aggregates critical information from pruned tokens while preserving fine-grained yet essential spatiotemporal context. Extensive experiments demonstrate that AOT significantly improves computational efficiency across multiple long- and short-video benchmarks, all while maintaining superior capabilities in modeling visual details and temporal dynamics.

Technology Category

Application Category

📝 Abstract

Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token \textbf{A}nchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global \textbf{O}ptimal \textbf{T}ransport (\textbf{AOT}). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity. Project webpage: \href{https://tyroneli.github.io/AOT}{AOT}.

Problem

Research questions and friction points this paper is trying to address.

Video Large Language Models

Token Redundancy

Spatiotemporal Compression

Visual Token Efficiency

Long-context Compressibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token Reduction

Optimal Transport

Video Large Language Models