FrameOracle: Learning What to See and How Much to See in Videos

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) are constrained by fixed-frame inputs, and conventional uniform or budget-driven frame sampling strategies fail to adapt to varying video information density and task complexity—leading to either redundancy or loss of critical frames. To address this, we propose the first lightweight, adaptive module that jointly predicts both the optimal number of keyframes and their identities. We further introduce FrameOracle-41K, a large-scale VideoQA dataset with expert-annotated keyframes. Our method employs a four-stage curriculum learning framework, integrating cross-modal similarity as a proxy signal alongside ground-truth keyframe supervision, enabling plug-and-play deployment. Extensive evaluation across five VLMs and six benchmarks demonstrates: (1) compression of 16-frame inputs to an average of 10.4 frames with zero accuracy degradation; and (2) reduction of 64-frame candidates to 13.9 frames while improving accuracy by 1.4%, achieving state-of-the-art efficiency–accuracy trade-off.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) have advanced video understanding, but their performance is limited by the number of input frames they can process. Existing frame sampling strategies, such as uniform or fixed-budget selection, often fail to adapt to variations in information density or task complexity, resulting in inefficiency and information loss. To address this, we present FrameOracle, a lightweight and plug-and-play module that predicts both (1) which frames are most relevant to a given query and (2) how many frames are needed. FrameOracle is trained using a four-stage curriculum, with the first three stages relying on weak proxy signals such as cross-modal similarity. In the final stage, it leverages stronger supervision from a new dataset we introduce, FrameOracle-41K, the first large-scale VideoQA collection to provide keyframe annotations specifying the minimal set of frames required to answer each question. Extensive experiments across five VLMs and six benchmarks demonstrate that FrameOracle reduces 16-frame inputs to an average of 10.4 frames without any loss in accuracy. When starting from 64-frame candidates, it reduces the input to an average of 13.9 frames while improving accuracy by 1.4%, achieving state-of-the-art efficiency-accuracy trade-offs for scalable video understanding.
Problem

Research questions and friction points this paper is trying to address.

Optimizing frame selection for video understanding efficiency
Adapting frame sampling to varying information density and complexity
Determining minimal frames needed for accurate video question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns which frames are relevant to queries
Predicts how many frames are needed for tasks
Uses curriculum training with weak and strong supervision
🔎 Similar Papers
No similar papers found.