🤖 AI Summary
Current AI systems struggle to integrate visual perception, causal reasoning, and sequential decision-making in the physical world, particularly in tasks requiring creative manipulation. This work proposes the first interactive synthetic environment centered on origami as a unified benchmark, tightly coupling visual understanding, symbolic reasoning, and modeling of geometric and physical constraints through iterative folding actions guided by feedback on physical validity and target similarity. The framework introduces an interactive simulation platform, a vision–language evaluation protocol, and a physics-based verification mechanism. Experimental results demonstrate that merely scaling up model size fails to substantially improve multi-step causal reasoning, revealing fundamental limitations in existing architectures regarding the integration of vision–language representations and the generation of coherent, goal-directed strategies.
📝 Abstract
Building AI systems that can plan, act, and create in the physical world requires more than pattern recognition. Such systems must understand the causal mechanisms and constraints governing physical processes in order to guide sequential decisions. This capability relies on internal representations, analogous to an internal language model, that relate observations, actions, and resulting environmental changes. However, many existing benchmarks treat visual perception and programmatic reasoning as separate problems, focusing either on visual recognition or on symbolic tasks. The domain of origami provides a natural testbed that integrates these modalities. Constructing shapes through folding operations requires visual perception, reasoning about geometric and physical constraints, and sequential planning, while remaining sufficiently structured for systematic evaluation. We introduce OrigamiBench, an interactive benchmark in which models iteratively propose folds and receive feedback on physical validity and similarity to a target configuration. Experiments with modern vision-language models show that scaling model size alone does not reliably produce causal reasoning about physical transformations. Models fail to generate coherent multi-step folding strategies, suggesting that visual and language representations remain weakly integrated.