Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

📅 2024-11-21
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Zero-shot video understanding faces a fundamental trade-off between computational efficiency and fine-grained detail preservation: existing approaches either rely heavily on large-scale annotated data and substantial compute resources, or exhibit limited robustness on complex, long-duration videos. To address this, we propose a fine-tuning-free, efficient framework centered on hierarchical keyframe sampling and a bipartite graph clustering–based dynamic visual token merging mechanism, enabling semantic-aware multi-granularity spatiotemporal feature compression. Our method adaptively retains critical scene-level and object-level details while drastically reducing computational overhead. Extensive evaluation across multiple zero-shot video understanding benchmarks—including Ego4D, Something-Something V2, and Kinetics—demonstrates consistent and significant improvements over both state-of-the-art supervised methods and existing training-free approaches, establishing new SOTA performance.

Technology Category

Application Category

📝 Abstract
Recent advancements in multimodal large language models (MLLMs) have opened new avenues for video understanding. However, achieving high fidelity in zero-shot video tasks remains challenging. Traditional video processing methods rely heavily on fine-tuning to capture nuanced spatial-temporal details, which incurs significant data and computation costs. In contrast, training-free approaches, though efficient, often lack robustness in preserving context-rich features across complex video content. To this end, we propose DYTO, a novel dynamic token merging framework for zero-shot video understanding that adaptively optimizes token efficiency while preserving crucial scene details. DYTO integrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences, striking a balance between computational efficiency with semantic richness. Extensive experiments across multiple benchmarks demonstrate the effectiveness of DYTO, achieving superior performance compared to both fine-tuned and training-free methods and setting a new state-of-the-art for zero-shot video understanding.
Problem

Research questions and friction points this paper is trying to address.

Achieving high fidelity in zero-shot video tasks
Balancing computational efficiency with semantic richness
Preserving context-rich features in complex video content
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic token merging for zero-shot video
Hierarchical frame selection strategy
Bipartite token merging for efficiency
Y
Yiming Zhang
HFIPS, Chinese Academy of Sciences; University of Science and Technology of China
Zhuokai Zhao
Zhuokai Zhao
Research Scientist, Meta AI
LLM AgentsMultimodal LLM ReasoningData-Efficient Learning
Zhaorun Chen
Zhaorun Chen
Ph.D. Student, UChicago CS
AI SafetyLLM AgentReinforcement Learning
Z
Zenghui Ding
HFIPS, Chinese Academy of Sciences
X
Xianjun Yang
HFIPS, Chinese Academy of Sciences; University of Science and Technology of China
Yining Sun
Yining Sun
Johns Hopkins University
Computer Vision