🤖 AI Summary
This work proposes a method for editable 3D scene reconstruction from a single image that operates without requiring specialized 2D/3D foundation models, differentiable rendering, or multi-view supervision. By introducing an agent framework grounded in general-purpose vision-language models, the inverse graphics task is decomposed into staged optimization of geometry, materials, composition, and lighting, directly generating executable Blender scripts. This approach achieves, for the first time, high-quality, renderable, relightable, and controllable 3D scene reconstruction using only off-the-shelf vision-language models. It substantially improves fidelity at pixel, perceptual, and semantic levels across diverse scenes and enables a range of downstream editing and rendering applications.
📝 Abstract
Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.