π€ AI Summary
Addressing the challenge of 4D understanding in dynamic scenes, this paper introduces the Dynamic Point Map (DPM)βthe first learnable, unified 4D point-based representation framework. DPM systematically constructs spatiotemporal reference frames and employs a minimally complete set of such frames to jointly model multiple geometric and motion estimation tasks, including motion segmentation, scene flow estimation, 3D object tracking, and 2D inter-frame correspondence. Our method leverages neural networks to regress pixel-aligned point clouds across multiple reference frames, trained end-to-end on both synthetic and real-world video data, seamlessly integrating depth prediction with geometric reasoning. Extensive experiments demonstrate state-of-the-art performance on benchmarks for video depth prediction, dynamic point cloud reconstruction, 3D scene flow estimation, and object pose tracking. DPM significantly advances joint geometric and motion modeling in dynamic scenes, establishing a scalable, task-agnostic foundation for 4D scene understanding.
π Abstract
DUSt3R has recently shown that one can reduce many tasks in multi-view geometry, including estimating camera intrinsics and extrinsics, reconstructing the scene in 3D, and establishing image correspondences, to the prediction of a pair of viewpoint-invariant point maps, i.e., pixel-aligned point clouds defined in a common reference frame. This formulation is elegant and powerful, but unable to tackle dynamic scenes. To address this challenge, we introduce the concept of Dynamic Point Maps (DPM), extending standard point maps to support 4D tasks such as motion segmentation, scene flow estimation, 3D object tracking, and 2D correspondence. Our key intuition is that, when time is introduced, there are several possible spatial and time references that can be used to define the point maps. We identify a minimal subset of such combinations that can be regressed by a network to solve the sub tasks mentioned above. We train a DPM predictor on a mixture of synthetic and real data and evaluate it across diverse benchmarks for video depth prediction, dynamic point cloud reconstruction, 3D scene flow and object pose tracking, achieving state-of-the-art performance. Code, models and additional results are available at https://www.robots.ox.ac.uk/~vgg/research/dynamic-point-maps/.