HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

📅 2024-11-27

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing video understanding methods struggle to model complex, high-order multi-object interactions and causal reasoning. To address this, we propose the first hypergraph-based multimodal large language model (MLLM) framework that unifies spatial entity structures and causal temporal dynamics, representing video scenes as higher-order relational hypergraphs injectable into LLMs. Our method integrates hypergraph neural networks, MLLM collaborative reasoning, cross-view representation learning, and a scene-graph structural injection mechanism. We further introduce the first large-scale, multi-view Video Scene Graph Reasoning (VSGR) benchmark supporting five tasks—including scene graph generation and future relation prediction. Extensive experiments demonstrate state-of-the-art performance across all five tasks, with significant improvements in modeling intricate object interactions and forecasting future relational states.

Technology Category

Application Category

📝 Abstract

Multimodal LLMs have advanced vision-language tasks but still struggle with understanding video scenes. To bridge this gap, Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames. However, prior methods rely on pairwise connections, limiting their ability to handle complex multi-object interactions and reasoning. To this end, we propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships. Our approach uniquely integrates entity scene graphs, which capture spatial relationships between objects, with a procedural graph that models their causal transitions, forming a unified HyperGraph. Significantly, HyperGLM enables reasoning by injecting this unified HyperGraph into LLMs. Additionally, we introduce a new Video Scene Graph Reasoning (VSGR) dataset featuring 1.9M frames from third-person, egocentric, and drone views and supports five tasks: Scene Graph Generation, Scene Graph Anticipation, Video Question Answering, Video Captioning, and Relation Reasoning. Empirically, HyperGLM consistently outperforms state-of-the-art methods across five tasks, effectively modeling and reasoning complex relationships in diverse video scenes.

Problem

Research questions and friction points this paper is trying to address.

Enhancing video scene understanding with HyperGraph for complex interactions

Integrating entity and procedural graphs for unified relationship modeling

Addressing limitations in multi-object reasoning across diverse video tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates entity and procedural graphs into HyperGraph

Injects HyperGraph into LLMs for enhanced reasoning

Introduces VSGR dataset with 1.9M diverse frames

🔎 Similar Papers

No similar papers found.

Authors to Follow