🤖 AI Summary
Existing 3D scene generation and editing methods lack a unified framework, failing to simultaneously support end-to-end, text-driven 3D Gaussian Splatting (3DGS) generation and editing, while exhibiting degraded performance on multi-scale scenes and complex camera trajectories. To address this, we propose the first integrated, text-driven 3DGS generation and editing framework. Our method introduces a multi-view calibrated flow model that jointly predicts RGB images, depth maps, and camera poses; a training-free 3DGS decoder enabling direct text-to-3DGS mapping; and a training-free inversion plus mask-based editing mechanism for zero-shot, real-time 3D editing. We validate our approach on MVImgNet and DL3DV-7K, demonstrating high-fidelity novel-view synthesis, precise object-level editing, and accurate camera pose estimation. Experimental results show substantial improvements in both efficiency and flexibility for 3D content creation.
📝 Abstract
Text-based generation and editing of 3D scenes hold significant potential for streamlining content creation through intuitive user interactions. While recent advances leverage 3D Gaussian Splatting (3DGS) for high-fidelity and real-time rendering, existing methods are often specialized and task-focused, lacking a unified framework for both generation and editing. In this paper, we introduce SplatFlow, a comprehensive framework that addresses this gap by enabling direct 3DGS generation and editing. SplatFlow comprises two main components: a multi-view rectified flow (RF) model and a Gaussian Splatting Decoder (GSDecoder). The multi-view RF model operates in latent space, generating multi-view images, depths, and camera poses simultaneously, conditioned on text prompts, thus addressing challenges like diverse scene scales and complex camera trajectories in real-world settings. Then, the GSDecoder efficiently translates these latent outputs into 3DGS representations through a feed-forward 3DGS method. Leveraging training-free inversion and inpainting techniques, SplatFlow enables seamless 3DGS editing and supports a broad range of 3D tasks-including object editing, novel view synthesis, and camera pose estimation-within a unified framework without requiring additional complex pipelines. We validate SplatFlow's capabilities on the MVImgNet and DL3DV-7K datasets, demonstrating its versatility and effectiveness in various 3D generation, editing, and inpainting-based tasks.