🤖 AI Summary
This work addresses unsupervised video understanding and forecasting in dynamic scenes. We propose an object-centric modeling framework grounded in phase correlation within the frequency domain. By recursively analyzing inter-frame phase relationships, our method explicitly disentangles object prototypes and models their geometric transformations—enabling fully unsupervised object decomposition, motion inference, and future-frame prediction. The core innovation lies in integrating frequency-domain phase correlation with lightweight learnable modules to construct interpretable and trackable object representations. Evaluated on multiple synthetic benchmarks, our approach substantially outperforms existing object-centric models in unsupervised object tracking and video prediction, while demonstrating superior generalization capability and computational efficiency.
📝 Abstract
Understanding and predicting video content is essential for planning and reasoning in dynamic environments. Despite advancements, unsupervised learning of object representations and dynamics remains challenging. We present VideoPCDNet, an unsupervised framework for object-centric video decomposition and prediction. Our model uses frequency-domain phase correlation techniques to recursively parse videos into object components, which are represented as transformed versions of learned object prototypes, enabling accurate and interpretable tracking. By explicitly modeling object motion through a combination of frequency domain operations and lightweight learned modules, VideoPCDNet enables accurate unsupervised object tracking and prediction of future video frames. In our experiments, we demonstrate that VideoPCDNet outperforms multiple object-centric baseline models for unsupervised tracking and prediction on several synthetic datasets, while learning interpretable object and motion representations.