VideoBrain: Learning Adaptive Frame Sampling for Long Video Understanding

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-form video understanding is hindered by high computational costs and dispersed information, leading to inefficient frame sampling. This work proposes VideoBrain, a novel framework featuring a dual-agent collaborative mechanism: a CLIP-based semantic retrieval agent and a uniform temporal sampling agent. These agents jointly perform end-to-end adaptive keyframe selection guided by a behavior-aware reinforcement learning reward function. Evaluated on four long-video benchmarks, the method outperforms existing baselines by 3.5%–9.0% on average while reducing input frame count by 30%–40%. Furthermore, it demonstrates strong generalization capability when applied to short-video tasks.

Technology Category

Application Category

📝 Abstract
Long-form video understanding remains challenging for Vision-Language Models (VLMs) due to the inherent tension between computational constraints and the need to capture information distributed across thousands of frames. Existing approaches either sample frames uniformly (risking information loss) or select keyframes in a single pass (with no recovery from poor choices). We propose VideoBrain, an end-to-end framework that enables VLMs to adaptively acquire visual information through learned sampling policies. Our approach features dual complementary agents: a CLIP-based agent for semantic retrieval across the video and a Uniform agent for dense temporal sampling within intervals. Unlike prior agent-based methods that rely on text-only LLMs orchestrating visual tools, our VLM directly perceives frames and reasons about information sufficiency. To prevent models from invoking agents indiscriminately to maximize rewards, we introduce a behavior-aware reward function coupled with a data classification pipeline that teaches the model when agent invocation is genuinely beneficial. Experiments on four long video benchmarks demonstrate that VideoBrain achieves +3.5% to +9.0% improvement over the baseline while using 30-40% fewer frames, with strong cross-dataset generalization to short video benchmarks.
Problem

Research questions and friction points this paper is trying to address.

long video understanding
frame sampling
vision-language models
information loss
computational constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive frame sampling
vision-language models
agent-based video understanding
behavior-aware reward
long-form video understanding
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30