Towards Scalable and Consistent 3D Editing

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

3D editing faces core challenges including cross-view inconsistency, geometric distortion, and reliance on manually annotated 3D masks. To address these, we introduce 3DEditVerse—the first large-scale paired benchmark for 3D editing—and propose 3DEditFormer, a conditional Transformer model. Our method employs a dual-guided attention mechanism to decouple edit regions from structural priors and incorporates time-varying adaptive gating to jointly enforce locality and multi-view consistency. Built upon an image-to-3D generation framework, it unifies pose-driven geometric editing with foundation-model-guided appearance editing—eliminating the need for precise 3D masks. Extensive evaluation across multiple benchmarks demonstrates state-of-the-art performance in editing fidelity, structural integrity, inference speed, and cross-view consistency. This work significantly advances scalable and practical 3D editing.

Technology Category

Application Category

📝 Abstract

3D editing - the task of locally modifying the geometry or appearance of a 3D asset - has wide applications in immersive content creation, digital entertainment, and AR/VR. However, unlike 2D editing, it remains challenging due to the need for cross-view consistency, structural fidelity, and fine-grained controllability. Existing approaches are often slow, prone to geometric distortions, or dependent on manual and accurate 3D masks that are error-prone and impractical. To address these challenges, we advance both the data and model fronts. On the data side, we introduce 3DEditVerse, the largest paired 3D editing benchmark to date, comprising 116,309 high-quality training pairs and 1,500 curated test pairs. Built through complementary pipelines of pose-driven geometric edits and foundation model-guided appearance edits, 3DEditVerse ensures edit locality, multi-view consistency, and semantic alignment. On the model side, we propose 3DEditFormer, a 3D-structure-preserving conditional transformer. By enhancing image-to-3D generation with dual-guidance attention and time-adaptive gating, 3DEditFormer disentangles editable regions from preserved structure, enabling precise and consistent edits without requiring auxiliary 3D masks. Extensive experiments demonstrate that our framework outperforms state-of-the-art baselines both quantitatively and qualitatively, establishing a new standard for practical and scalable 3D editing. Dataset and code will be released. Project: https://www.lv-lab.org/3DEditFormer/

Problem

Research questions and friction points this paper is trying to address.

Achieving cross-view consistency in 3D editing without geometric distortions

Eliminating dependency on manual 3D masks for precise local modifications

Enabling scalable and practical 3D editing for immersive content creation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest paired 3D editing benchmark with 116,309 training pairs

3D-structure-preserving conditional transformer for disentangling edits

Dual-guidance attention and time-adaptive gating enable mask-free editing

🔎 Similar Papers

View-Consistent 3D Editing with Gaussian Splatting