🤖 AI Summary
This work addresses texture artifacts and blurred object boundaries in text-guided 3D Gaussian Splatting editing, caused by multi-view geometric inconsistency and insufficient depth utilization. To tackle these issues, we propose two core innovations: (1) a complementary information mutual learning network that significantly improves cross-view depth estimation accuracy; and (2) a wavelet consensus attention mechanism that aligns latent-space representations across views during diffusion denoising. Our method tightly integrates 3D Gaussian Splatting, depth-conditioned editing, and implicit modeling via diffusion priors, preserving geometric fidelity while enhancing texture plausibility and boundary sharpness. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across key metrics—including rendering quality, multi-view consistency, and editing fidelity—outperforming existing methods comprehensively.
📝 Abstract
We present a novel framework for enhancing the visual fidelity and consistency of text-guided 3D Gaussian Splatting (3DGS) editing. Existing editing approaches face two critical challenges: inconsistent geometric reconstructions across multiple viewpoints, particularly in challenging camera positions, and ineffective utilization of depth information during image manipulation, resulting in over-texture artifacts and degraded object boundaries. To address these limitations, we introduce: 1) A complementary information mutual learning network that enhances depth map estimation from 3DGS, enabling precise depth-conditioned 3D editing while preserving geometric structures. 2) A wavelet consensus attention mechanism that effectively aligns latent codes during the diffusion denoising process, ensuring multi-view consistency in the edited results. Through extensive experimentation, our method demonstrates superior performance in rendering quality and view consistency compared to state-of-the-art approaches. The results validate our framework as an effective solution for text-guided editing of 3D scenes.