Flow and Depth Assisted Video Prediction with Latent Transformer

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key challenges in video prediction: occlusion handling and background motion modeling. Methodologically, it proposes a latent-variable Transformer framework integrated with geometric priors—specifically, point flow and depth maps—to explicitly encode motion structure and scene geometry as conditional inputs; additionally, it introduces a Wasserstein distance loss grounded in object masks to enforce consistency between predicted and ground-truth motion distributions. The core contribution lies in the principled fusion of dense geometric cues with probabilistic latent modeling, substantially improving joint representation of dynamic occlusions, complex motions, and background variations. Extensive experiments on both synthetic and real-world benchmarks demonstrate that the method achieves superior motion prediction accuracy and background consistency under occlusion compared to state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Video prediction is a fundamental task for various downstream applications, including robotics and world modeling. Although general video prediction models have achieved remarkable performance in standard scenarios, occlusion is still an inherent challenge in video prediction. We hypothesize that providing explicit information about motion (via point-flow) and geometric structure (via depth-maps) will enable video prediction models to perform better in situations with occlusion and the background motion. To investigate this, we present the first systematic study dedicated to occluded video prediction. We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow. We evaluate this model in a controlled setting on both synthetic and real-world datasets with not only appearance-based metrics but also Wasserstein distances on object masks, which can effectively measure the motion distribution of the prediction. We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.
Problem

Research questions and friction points this paper is trying to address.

Video prediction struggles with occlusion and background motion
Explicit motion and geometric information improves occluded prediction
Systematic study of depth and flow assisted video prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates point-flow for motion information
Integrates depth-maps for geometric structure
Uses latent transformer for occluded video prediction
🔎 Similar Papers
No similar papers found.
E
Eliyas Suleyman
School of Coumputing Science, University of Glasglow, Glasgow, Scotland, UK, G12 8QQ
Paul Henderson
Paul Henderson
University of Glasgow
computer visionmachine learning
E
Eksan Firkat
Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
Nicolas Pugeault
Nicolas Pugeault
Reader, School of Computing Science, University of Glasgow
Computer VisionMachine LearningCognitive Robotics