OrigamiBench: An Interactive Environment to Synthesize Flat-Foldable Origamis

📅 2026-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current AI systems struggle to integrate visual perception, causal reasoning, and sequential decision-making in the physical world, particularly in tasks requiring creative manipulation. This work proposes the first interactive synthetic environment centered on origami as a unified benchmark, tightly coupling visual understanding, symbolic reasoning, and modeling of geometric and physical constraints through iterative folding actions guided by feedback on physical validity and target similarity. The framework introduces an interactive simulation platform, a vision–language evaluation protocol, and a physics-based verification mechanism. Experimental results demonstrate that merely scaling up model size fails to substantially improve multi-step causal reasoning, revealing fundamental limitations in existing architectures regarding the integration of vision–language representations and the generation of coherent, goal-directed strategies.

Technology Category

Application Category

📝 Abstract
Building AI systems that can plan, act, and create in the physical world requires more than pattern recognition. Such systems must understand the causal mechanisms and constraints governing physical processes in order to guide sequential decisions. This capability relies on internal representations, analogous to an internal language model, that relate observations, actions, and resulting environmental changes. However, many existing benchmarks treat visual perception and programmatic reasoning as separate problems, focusing either on visual recognition or on symbolic tasks. The domain of origami provides a natural testbed that integrates these modalities. Constructing shapes through folding operations requires visual perception, reasoning about geometric and physical constraints, and sequential planning, while remaining sufficiently structured for systematic evaluation. We introduce OrigamiBench, an interactive benchmark in which models iteratively propose folds and receive feedback on physical validity and similarity to a target configuration. Experiments with modern vision-language models show that scaling model size alone does not reliably produce causal reasoning about physical transformations. Models fail to generate coherent multi-step folding strategies, suggesting that visual and language representations remain weakly integrated.
Problem

Research questions and friction points this paper is trying to address.

causal reasoning
physical constraints
visual perception
sequential planning
integrated reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

OrigamiBench
flat-foldable origami
causal reasoning
multimodal integration
interactive benchmark
🔎 Similar Papers
No similar papers found.
N
Naaisha Agarwal
Algoverse AI Research
Y
Yihan Wu
State Key Lab of CAD&CG, Zhejiang University
Y
Yichang Jian
State Key Lab of CAD&CG, Zhejiang University
Y
Yifei Peng
State Key Lab of CAD&CG, Zhejiang University
Yao-Xiang Ding
Yao-Xiang Ding
Assistant Professor, Zhejiang University
machine learning
N
Nishad Mansoor
Computer Science, Northeastern University
Y
Yikuan Hu
National Key Laboratory for Novel Software Technology, Nanjing University
M
Mohan Li
National Key Laboratory for Novel Software Technology, Nanjing University
W
Wang-Zhou Dai
National Key Laboratory for Novel Software Technology, Nanjing University
Emanuele Sansone
Emanuele Sansone
KU Leuven/MIT