Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D Engine

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing text-to-image diffusion models excel at semantic editing but struggle with 3D-aware object manipulation: they fail to preserve physically plausible shadows and reflections, and iterative edits often degrade global scene consistency. To address this, we propose the first 3D-aware autoregressive image editing framework, which models edits as learned sequences of 3D transformations—translation, scaling, and rotation—without requiring explicit 3D reconstruction. Our method jointly leverages text-to-image priors and spatial transformation modeling within an autoregressive diffusion architecture. We introduce 3DObjectEditor, a purpose-built hybrid dataset supporting multi-step dynamic editing, to train robust sequential editing capabilities. Experiments demonstrate that our approach significantly outperforms state-of-the-art methods on both single-step and multi-step 3D editing tasks, yielding outputs with enhanced physical plausibility, superior visual fidelity, and improved global scene coherence.

Technology Category

Application Category

📝 Abstract

Recent advances in text-to-image (T2I) diffusion models have significantly improved semantic image editing, yet most methods fall short in performing 3D-aware object manipulation. In this work, we present FFSE, a 3D-aware autoregressive framework designed to enable intuitive, physically-consistent object editing directly on real-world images. Unlike previous approaches that either operate in image space or require slow and error-prone 3D reconstruction, FFSE models editing as a sequence of learned 3D transformations, allowing users to perform arbitrary manipulations, such as translation, scaling, and rotation, while preserving realistic background effects (e.g., shadows, reflections) and maintaining global scene consistency across multiple editing rounds. To support learning of multi-round 3D-aware object manipulation, we introduce 3DObjectEditor, a hybrid dataset constructed from simulated editing sequences across diverse objects and scenes, enabling effective training under multi-round and dynamic conditions. Extensive experiments show that the proposed FFSE significantly outperforms existing methods in both single-round and multi-round 3D-aware editing scenarios.

Problem

Research questions and friction points this paper is trying to address.

Enabling 3D-aware object manipulation in real-world images

Achieving physically-consistent multi-round editing operations

Preserving realistic scene effects during object transformations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive framework for 3D-aware object manipulation

Learned 3D transformations preserve realistic background effects

Hybrid dataset enables multi-round editing training

🔎 Similar Papers

No similar papers found.