HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

๐Ÿ“… 2024-11-27
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing video understanding methods struggle to model complex, high-order multi-object interactions and causal reasoning. To address this, we propose the first hypergraph-based multimodal large language model (MLLM) framework that unifies spatial entity structures and causal temporal dynamics, representing video scenes as higher-order relational hypergraphs injectable into LLMs. Our method integrates hypergraph neural networks, MLLM collaborative reasoning, cross-view representation learning, and a scene-graph structural injection mechanism. We further introduce the first large-scale, multi-view Video Scene Graph Reasoning (VSGR) benchmark supporting five tasksโ€”including scene graph generation and future relation prediction. Extensive experiments demonstrate state-of-the-art performance across all five tasks, with significant improvements in modeling intricate object interactions and forecasting future relational states.

Technology Category

Application Category

๐Ÿ“ Abstract
Multimodal LLMs have advanced vision-language tasks but still struggle with understanding video scenes. To bridge this gap, Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames. However, prior methods rely on pairwise connections, limiting their ability to handle complex multi-object interactions and reasoning. To this end, we propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships. Our approach uniquely integrates entity scene graphs, which capture spatial relationships between objects, with a procedural graph that models their causal transitions, forming a unified HyperGraph. Significantly, HyperGLM enables reasoning by injecting this unified HyperGraph into LLMs. Additionally, we introduce a new Video Scene Graph Reasoning (VSGR) dataset featuring 1.9M frames from third-person, egocentric, and drone views and supports five tasks: Scene Graph Generation, Scene Graph Anticipation, Video Question Answering, Video Captioning, and Relation Reasoning. Empirically, HyperGLM consistently outperforms state-of-the-art methods across five tasks, effectively modeling and reasoning complex relationships in diverse video scenes.
Problem

Research questions and friction points this paper is trying to address.

Enhancing video scene understanding with HyperGraph for complex interactions
Integrating entity and procedural graphs for unified relationship modeling
Addressing limitations in multi-object reasoning across diverse video tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates entity and procedural graphs into HyperGraph
Injects HyperGraph into LLMs for enhanced reasoning
Introduces VSGR dataset with 1.9M diverse frames
๐Ÿ”Ž Similar Papers
No similar papers found.