🤖 AI Summary
This study investigates the fundamental limitations of large vision-language models (VLMs) in mechanical reasoning—specifically across gear systems, fluid dynamics, levers, pulleys, and inertial motion—and identifies pathways for improvement. Method: Leveraging 155 cognitive psychology experiments, we systematically evaluate 26 state-of-the-art VLMs and introduce MechBench, the first comprehensive, multi-physics-domain benchmark for mechanical reasoning. Contribution/Results: All models exhibit substantial performance gaps relative to human baselines, especially in gear engagement modeling and dynamic fluid understanding; model size shows no significant correlation with performance, exposing intrinsic limitations of current attention mechanisms in mental simulation–based reasoning. Our work establishes the first controllable, cross-physics-domain evaluation framework; through cross-modal behavioral analysis, it pinpoints systematic reasoning failures; and it provides a novel assessment paradigm and concrete directions for advancing embodied AI and physical commonsense modeling.
📝 Abstract
Mechanical reasoning is a hallmark of human intelligence, defined by its ubiquitous yet irreplaceable role in human activities ranging from routine tasks to civil engineering. Embedding machines with mechanical reasoning is therefore an important step towards building human-level artificial intelligence. Here, we leveraged 155 cognitive experiments to test the understanding of system stability, gears and pulley systems, leverage principle, inertia and motion, and fluid mechanics in 26 Vision Language Models (VLMs). Results indicate that VLMs consistently perform worse than humans on all domains, while demonstrate significant difficulty in reasoning about gear systems and fluid mechanics. Notably, their performance on these tasks do not improve as number of parameters increase, suggesting that current attention-based architecture may fail to grasp certain underlying mechanisms required for mechanical reasoning, particularly those pertaining to mental simulations.