Progressive Video Condensation with MLLM Agent for Long-form Video Understanding

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently extracting query-relevant, fine-grained visual information from long videos under constrained computational resources. The authors propose ProVCA, a training-free progressive video condensation agent that employs a multi-granularity iterative mechanism: it first localizes relevant video segments, then selects salient sub-segments, and finally refines key frames for zero-shot reasoning by multimodal large language models (MLLMs). ProVCA is the first method to achieve training-free, progressive multi-granularity condensation, substantially reducing input frame count while preserving critical visual details. Experimental results demonstrate that ProVCA attains zero-shot accuracies of 69.3%, 80.5%, and 77.7% on EgoSchema, NExT-QA, and IntentQA, respectively, outperforming existing training-free approaches.
📝 Abstract
Understanding long videos requires extracting query-relevant information from long sequences under tight compute budgets. Existing text-then-LLM pipelines lose fine-grained visual cues, while video-based multimodal large language models (MLLMs) can keep visual details but are too frame-hungry and computationally expensive. In this work, we aim to harness MLLMs for efficient video understanding. We propose ProVCA, a progressive video condensation agent that iteratively locates key video frames at multiple granularities. ProVCA first adopts a segment localization module to identify the video segment relevant to the query, then a snippet selection module to select important snippets based on similarity, and finally a keyframe refinement module to pinpoint specific keyframes in those snippets. By progressively narrowing the scope from coarse segments to fine frames, ProVCA identifies a small set of keyframes for MLLM-based reasoning. ProVCA achieves state-of-the-art zero-shot accuracies of 69.3\% on EgoSchema, 80.5\% on NExT-QA, and 77.7\% on IntentQA, while using fewer frames than previous training-free methods.
Problem

Research questions and friction points this paper is trying to address.

long-form video understanding
video condensation
multimodal large language models
compute efficiency
keyframe selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

progressive video condensation
multimodal large language model
keyframe selection
long-form video understanding
query-guided localization
🔎 Similar Papers
No similar papers found.
Y
Yufei Yin
Zhejiang Key Laboratory of Space Information Sensing and Transmission, Hangzhou Dianzi University, China
Y
Yuchen Xing
Zhejiang Key Laboratory of Space Information Sensing and Transmission, Hangzhou Dianzi University, China
Q
Qianke Meng
Zhejiang Key Laboratory of Space Information Sensing and Transmission, Hangzhou Dianzi University, China
Minghao Chen
Minghao Chen
Hangzhou Dianzi University
Deep LearningDomain AdaptationVision and LanguageLLM Agents
Yan Yang
Yan Yang
Hangzhou Dianzi University
Machine LearningArtificial IntelligenceVision and LanguageMedical Image Analysis
Z
Zhou Yu
Zhejiang Key Laboratory of Space Information Sensing and Transmission, Hangzhou Dianzi University, China