🤖 AI Summary
Existing agent-based approaches for long-form video understanding struggle to accurately perceive fine-grained motion and efficiently model long-range dependencies. This work introduces, for the first time, the Group of Pictures (GOP) structure from video codecs into video understanding frameworks, proposing a motion-aware agent architecture that integrates GOP-based tree reasoning, a structured memory mechanism, and a coarse-to-fine scaling strategy. This enables efficient modeling of local motion details and rapid retrieval of multi-granularity motion vectors. The proposed method achieves significant performance gains on long-video question-answering benchmarks such as MotionBench and Egoschema, demonstrating the effectiveness and novelty of leveraging GOP structures for video understanding.
📝 Abstract
Despite significant progress in agentic long video understanding, existing methods still lack detailed motion comprehension coupled with an efficient memory architecture. In this paper, we propose GOPAgen, a novel approach that first integrates video codec into the video understanding framework via a meticulously designed motion agent trained on Groups of Pictures (GOPs) from video codec. We further develop a GOP tree reasoning algorithm, which is naturally aligned with video codec and enhances the model's ability to understand local detailed motions in videos. Additionally, we carefully design a structural memory mechanism that integrates local motion information with detailed captions in structural pages, and propose an efficient coarse-to-fine zoom-in algorithm to fully exploit the structural memory. Furthermore, we incorporate a motion vector database into the framework to enable efficient retrieval of motion vectors at different granularities. Overall, our method achieves superior Video Question Answering (VQA) performance on various video understanding benchmarks, including MotionBench and Egoschema, thereby demonstrating the superiority of our proposed framework.