🤖 AI Summary
This study investigates whether pretrained video foundation models encode intuitive physical knowledge within their frozen representations and systematically analyzes their performance across different architectures, network depths, and probing methodologies. By conducting frozen-feature probing and frame-order perturbation experiments on the IntPhys2 and Minimal Video Pairs benchmarks with models such as V-JEPA, VideoMAE, and LTX-Video, this work presents the first comparative assessment of how three distinct pretraining paradigms—joint embedding, masked reconstruction, and diffusion-based generation—affect physical understanding. The results demonstrate that V-JEPA consistently achieves superior performance, particularly in modeling temporal dynamics. Furthermore, physical information is predominantly localized in middle-to-deep layers and is highly sensitive to frame ordering, indicating that intuitive physics can indeed be effectively encoded, albeit significantly modulated by the choice of pretraining objective and readout mechanism.
📝 Abstract
We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this information varies across model families, layers, and probe types. Using frozen-feature probing on IntPhys2 and Minimal Video Pairs (MVP), we compare predictive joint-embedding models (V-JEPA), masked reconstruction models (VideoMAE), and a diffusion-based video generator (LTX-Video). V-JEPA achieves the strongest overall results across benchmarks, especially with probes that model temporal dynamics, while VideoMAE remains competitive and LTX-Video recovers weaker but non-trivial signal. Layerwise analyses show that physics-relevant information is weakest in early layers and becomes most accessible at intermediate-to-late depth, and temporal controls show that disrupting frame order substantially reduces performance, especially on MVP. Together, these results suggest that intuitive-physics knowledge emerges reliably in pretrained video representations, but its accessibility depends strongly on pretraining paradigm, representational depth, and readout mechanism.