๐ค AI Summary
Existing evaluation methods for web generation are confined to pixel- or DOM-level comparisons, rendering them inadequate for assessing the implicit physical states and interaction logic in 3D scenes generated with frameworks like Three.js. This work proposes the first benchmark specifically designed for evaluating 3D interactive world generation under physical constraints, introducing StateProbeโa state-aware evaluation mechanism. StateProbe leverages a sandboxed browser execution environment, mutation-hardened state monitoring, integrated .glb asset support, and an end-to-end framework translating natural language instructions into executable 3D programs, thereby enabling assessment across simulation, rendering, and application tasks. The benchmark introduces multidimensional metrics including validation coverage, automated reward rate, and time-efficiency multiplier. On the WorldCoder-Core and Robust subsets, the best-performing models achieve only 27.8% and 19.9% validation coverage, respectively, with primary failure modes attributed to state-structure drift and broken interaction chains, though lightweight models retain practical utility on simpler tasks.
๐ Abstract
Large language models (LLMs) are increasingly asked not only to write static interfaces, but to construct executable interactive worlds from natural language. Browser-native 3D, commonly built with Three.js, is a natural next frontier: generated programs must integrate assets, obey spatial and physical constraints, and keep user-facing controls synchronized with hidden runtime state. Existing web-generation benchmarks and evaluators, however, largely observe only pixels or DOM nodes, while the mechanics of a Three.js world unfold inside an opaque <canvas>. We introduce WorldCoder-Bench, a benchmark for autonomous, physically grounded 3D world synthesis. WorldCoder-Bench contains 2,026 expert-curated tasks across Simulation, Rendering, and Application scenarios, with optional .glb assets and hidden behavioral contracts. We further propose StateProbe, an execution-based protocol that probes generated programs in a sandboxed browser and verifies hidden, mutation-hardened contracts over runtime states and transitions. Beyond verification coverage, we report Return on Automation and Time Efficiency Multiplier to measure correctness-adjusted cost and time savings. Across nine frontier models, the best system reaches only 27.8% verification coverage on WorldCoder-Core and 19.9% on WorldCoder-Robust, with failures dominated by state-schema drift and broken interaction chains rather than missing scene elements. Utility metrics further show that cheap or fast models can still provide substantial value on easier domains. WorldCoder-Bench is available at https://anonymous.4open.science/r/WorldCoder-Bench/.