Visual Relationship Forecasting in Videos

📅 2021-07-02
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual relationship forecasting (VRF) methods suffer from label noise, weak semantic correlation among relationships, and insufficient discriminative capability for visually similar or temporally invariant relations. This paper introduces video-level VRF—a novel task that forecasts the interaction relationship between subject-object pairs over T future frames *without visual input*, conditioned solely on the preceding H frames. To support this task, we construct the first benchmark datasets—VRF-AG and VRF-VidOR—featuring spatiotemporally localized relationship annotations. We propose the Graph Convolutional Network–Transformer (GCT) hybrid architecture, which jointly models object dynamics via spatiotemporal graph convolutions and captures long-range dependencies via Transformers, augmented with relation-aware and multi-granularity (object-level and frame-level) dependency modeling. Experiments demonstrate that GCT significantly outperforms existing sequential modeling approaches on both datasets, validating its effectiveness and generalizability for long-horizon relational reasoning.
📝 Abstract
Real-world scenarios often require the anticipation of object interactions in unknown future, which would assist the decision-making process of both humans and agents. To meet this challenge, we present a new task named Visual Relationship Forecasting (VRF) in videos to explore the prediction of visual relationships in a reasoning manner. Specifically, given a subject-object pair with H existing frames, VRF aims to predict their future interactions for the next T frames without visual evidence. To evaluate the VRF task, we introduce two video datasets named VRF-AG and VRF-VidOR, with a series of spatio-temporally localized visual relation annotations in a video. These two datasets densely annotate 13 and 35 visual relationships in 1923 and 13447 video clips, respectively. In addition, we present a novel Graph Convolutional Transformer (GCT) framework, which captures both object-level and frame-level dependencies by spatio-temporal Graph Convolution Network and Transformer. Experimental results on both VRF-AG and VRF-VidOR datasets demonstrate that GCT outperforms the state-of-the-art sequence modelling methods on visual relationship forecasting.
Problem

Research questions and friction points this paper is trying to address.

VRF datasets lack semantic coherence due to noisy annotations
Existing methods struggle to distinguish similar visual relationships
Current approaches overfit to static relationships in consecutive frames
Innovation

Methods, ideas, or system contributions that make the work stand out.

RAM distinguishes similar visual relationships
CRM focuses on relationship transition dynamics
SemCoBench benchmark ensures semantic coherence modeling
🔎 Similar Papers
No similar papers found.
Li Mi
Li Mi
EPFL
geospatial multimodalityvision-languageremote sensing
Y
Yangjun Ou
School of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan, China
Z
Zhenzhong Chen
School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China