🤖 AI Summary
This work addresses the challenge of efficiently extracting query-relevant, fine-grained visual information from long videos under constrained computational resources. The authors propose ProVCA, a training-free progressive video condensation agent that employs a multi-granularity iterative mechanism: it first localizes relevant video segments, then selects salient sub-segments, and finally refines key frames for zero-shot reasoning by multimodal large language models (MLLMs). ProVCA is the first method to achieve training-free, progressive multi-granularity condensation, substantially reducing input frame count while preserving critical visual details. Experimental results demonstrate that ProVCA attains zero-shot accuracies of 69.3%, 80.5%, and 77.7% on EgoSchema, NExT-QA, and IntentQA, respectively, outperforming existing training-free approaches.
📝 Abstract
Understanding long videos requires extracting query-relevant information from long sequences under tight compute budgets. Existing text-then-LLM pipelines lose fine-grained visual cues, while video-based multimodal large language models (MLLMs) can keep visual details but are too frame-hungry and computationally expensive. In this work, we aim to harness MLLMs for efficient video understanding. We propose ProVCA, a progressive video condensation agent that iteratively locates key video frames at multiple granularities. ProVCA first adopts a segment localization module to identify the video segment relevant to the query, then a snippet selection module to select important snippets based on similarity, and finally a keyframe refinement module to pinpoint specific keyframes in those snippets. By progressively narrowing the scope from coarse segments to fine frames, ProVCA identifies a small set of keyframes for MLLM-based reasoning. ProVCA achieves state-of-the-art zero-shot accuracies of 69.3\% on EgoSchema, 80.5\% on NExT-QA, and 77.7\% on IntentQA, while using fewer frames than previous training-free methods.