Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work proposes a method for editable 3D scene reconstruction from a single image that operates without requiring specialized 2D/3D foundation models, differentiable rendering, or multi-view supervision. By introducing an agent framework grounded in general-purpose vision-language models, the inverse graphics task is decomposed into staged optimization of geometry, materials, composition, and lighting, directly generating executable Blender scripts. This approach achieves, for the first time, high-quality, renderable, relightable, and controllable 3D scene reconstruction using only off-the-shelf vision-language models. It substantially improves fidelity at pixel, perceptual, and semantic levels across diverse scenes and enables a range of downstream editing and rendering applications.

📝 Abstract

Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.

Problem

Research questions and friction points this paper is trying to address.

inverse graphics

3D scene reconstruction

executable representation

single-image reconstruction

Blender

Innovation

Methods, ideas, or system contributions that make the work stand out.

executable inverse graphics

vision-language models

Blender program synthesis