🤖 AI Summary
This work addresses the lack of realistic, fine-grained evaluation benchmarks for low-level Newtonian physics in current foundation models. To bridge the gap between synthetic data and real-world complexity, we introduce a novel 4D dataset that integrates multi-view images from real scenes with high-fidelity physics simulations, uniquely combining visual realism with dense spatiotemporal physical annotations—including 3D forces and pixel-level implicit labels. Building upon this dataset, we establish a comprehensive evaluation framework for vision-language models (VLMs) and vision foundation models (VFMs), conducting systematic assessments on 56 VLMs and 10 VFMs. Our analysis reveals significant deficiencies in existing models’ ability to reason about low-level physical dynamics, while providing the community with a high-quality dataset and standardized benchmark to advance future research in physically grounded vision understanding.
📝 Abstract
Previous work has evaluated physics reasoning in foundation models using synthetic or semi-synthetic scenes and visual question-answering tasks. However, these benchmarks emphasize high-level events and lack the visual fidelity required to assess true low-level Newtonian understanding. We introduce NewtPhys, a 4D physically annotated dataset built from multiview images of real-world scenes with physics-grounded simulations. The dataset provides dense, fine-grained annotations across timesteps -- including 3D forces and amodal per-pixel quantities covering physics, tracking, semantics and geometry -- bridging the gap between simplistic synthetic setups and realistic visual complexity. Using NewtPhys, we systematically evaluate 56 VLMs, including 54 open-weight models and 2 closed-source frontier models, and 10 VFMs and reveal limitations in low-level physics reasoning. Beyond benchmarking, our dataset enables future research in physics-grounded vision and the development of next-generation physics-aware evaluations. Code and datasets are available at https://astra-vision.github.io/NewtPhys.