EEA: Exploration-Exploitation Agent for Long Video Understanding

๐Ÿ“… 2025-12-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Long-video understanding faces dual challenges of information sparsity and prohibitive computational cost: dense processing is inefficient, while coarse-grained strategies often miss critical frames. To address this, we propose a semantic-guided hierarchical tree search framework that dynamically balances exploration (covering unseen segments) and exploitation (focusing on salient semantic cues) via an adaptive mechanism integrating intrinsic rewards and semantic priors. Our method jointly leverages a hierarchical tree structure, vision-language model (VLM)-driven intrinsic reward, semantic anchor matching, and uncertainty-aware pruningโ€”where semantic queries are updated online and epistemic uncertainty is explicitly modeled. Evaluated on multiple long-video understanding benchmarks, our approach achieves state-of-the-art performance at significantly reduced computational cost, demonstrating superior accuracy, strong generalization across domains, and high inference efficiency.

Technology Category

Application Category

๐Ÿ“ Abstract
Long-form video understanding requires efficient navigation of extensive visual data to pinpoint sparse yet critical information. Current approaches to longform video understanding either suffer from severe computational overhead due to dense preprocessing, or fail to effectively balance exploration and exploitation, resulting in incomplete information coverage and inefficiency. In this work, we introduce EEA, a novel video agent framework that archives exploration-exploitation balance through semantic guidance with hierarchical tree search process. EEA autonomously discovers and dynamically updates task-relevant semantic queries, and collects video frames closely matched to these queries as semantic anchors. During the tree search process, instead of uniform expansion, EEA preferentially explores semantically relevant frames while ensuring sufficient coverage within unknown segments. Moreover, EEA adaptively combines intrinsic rewards from visionlanguage models (VLMs) with semantic priors by explicitly modeling uncertainty to achieve stable and precise evaluation of video segments. Experiments across various long-video benchmarks validate the superior performance and computational efficiency of our proposed method.
Problem

Research questions and friction points this paper is trying to address.

Balances exploration and exploitation in long video analysis
Reduces computational overhead in processing extensive visual data
Improves information coverage and efficiency in video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical tree search balances exploration and exploitation
Dynamic semantic queries guide frame selection as anchors
Uncertainty modeling combines vision-language rewards with priors
๐Ÿ”Ž Similar Papers
No similar papers found.
Te Yang
Te Yang
Institute of Automation, Chinese Academy of Sciences
Multimodal Large Language Models
X
Xiangyu Zhu
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
B
Bo Wang
Kuaishou Technology, Beijing, China
Q
Quan Chen
Kuaishou Technology, Beijing, China
P
Peng Jiang
Kuaishou Technology, Beijing, China
Zhen Lei
Zhen Lei
Associate Professor, OSCO Research Chair in Off-site Construction
Offsite ConstructionConstruction Engineering and Management