🤖 AI Summary
Existing vision-language models struggle to comprehend dynamic 4D scenes, primarily due to limitations imposed by 2D projections and the conflation of camera and object motion. To address this, this work proposes a True-Motion Tracking mechanism that decouples camera and object motion within a fixed reference frame. By integrating multi-view video, 3D reconstruction, and object tracking, we establish a scalable pipeline for 4D-aware question answering. Leveraging this framework, we introduce 4DP-QA, the first large-scale QA dataset for 4D scene understanding, comprising 400,000 training samples, along with 4DP-QA-Bench, a benchmark of 2,200 evaluation samples. Our approach significantly enhances model performance on external benchmarks, demonstrating the effectiveness of explicit 4D motion modeling for visual-language reasoning.
📝 Abstract
Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about a 4D scene, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection onto 2D images. Second, existing datasets fail to disentangle object and camera motion. To address these challenges, we present a QA generation pipeline that focuses on motion-related scene understanding. We take particular care of the entanglement of camera and object motion by casting tracking in both the traditional way and in a novel, fixed reference system, dubbed True-Motion Tracking, which provides an intuitive description of motion. From this pipeline, we generate a large-scale training dataset of 400K samples, 4DP-QA (4D Perception QA), and a 2.2K-sample benchmark, 4DP-QA-Bench. Training existing models on our dataset yields performance improvements on an external benchmark, validating the effectiveness of our method.