BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs

πŸ“… 2026-05-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

207K/year
πŸ€– AI Summary
Multimodal large language models excel at static image understanding but exhibit significant limitations in intuitive physical reasoning from single images, such as predicting object motion and interactions. This work introduces the first benchmark for billiard scenes grounded in a procedural physics engine, systematically evaluating model performance on three tasks: collision prediction, wall rebound modeling, and final-state position estimation. Experiments reveal a pervasive β€œstatic bias,” wherein models tend to predict no interaction in complex or long-duration scenarios, with performance markedly degrading as geometric complexity and simulation duration increase. The study provides a structured evaluation framework and empirical evidence to guide the incorporation of stronger physical inductive biases into multimodal models.
πŸ“ Abstract
Current multimodal models handle static image recognition well, but intuitive physical reasoning remains a weakness. Predicting how objects will move and interact from a single image is still difficult for these systems. We present BilliardPhys-Bench, a benchmark for physical reasoning in synthetic billiards environments. Its procedural engine generates randomized scenarios with friction and elastic collisions. The benchmark tests three abilities: (1) predicting ball-to-ball collisions, (2) reasoning about wall bounces, and (3) estimating final ball positions after motion stops. We evaluate recent MLLMs from the GPT, Claude, Gemini, and Qwen families. Performance drops as simulation time increases and scene geometry grows more complex. We also observe a consistent failure mode we call "stasis bias": when the correct physical outcome is harder to infer, models tend to predict no interaction. These findings show where current MLLMs break down on visual dynamics and point toward the need for better physical inductive biases in multimodal architectures.
Problem

Research questions and friction points this paper is trying to address.

physical reasoning
visual dynamics
multimodal LLMs
intuitive physics
object interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

physical reasoning
multimodal LLMs
visual dynamics
stasis bias
procedural benchmark
πŸ”Ž Similar Papers
No similar papers found.