🤖 AI Summary
This work addresses cross-view texture inconsistency and low inference efficiency in text-driven 3D texture generation. We propose the first end-to-end diffusion method that performs multi-view noise prediction fusion directly in the UV parameterization space. Our core innovation integrates multi-view rendering projections into the diffusion process, jointly optimizing noise predictions across views in UV space at each denoising step, while enforcing cross-view consistency constraints to ensure geometric and textural alignment—eliminating the need for conventional iterative optimization or sequential rendering. Built upon fine-tuned Stable Diffusion, our method supports high-quality, seam-free texture synthesis on arbitrary-topology meshes. Experiments demonstrate a 3.2× inference speedup over state-of-the-art methods, along with significant improvements in FID and LPIPS metrics. A user study further confirms superior visual consistency.
📝 Abstract
We introduce MD-ProjTex, a method for fast and consistent text-guided texture generation for 3D shapes using pretrained text-to-image diffusion models. At the core of our approach is a multi-view consistency mechanism in UV space, which ensures coherent textures across different viewpoints. Specifically, MD-ProjTex fuses noise predictions from multiple views at each diffusion step and jointly updates the per-view denoising directions to maintain 3D consistency. In contrast to existing state-of-the-art methods that rely on optimization or sequential view synthesis, MD-ProjTex is computationally more efficient and achieves better quantitative and qualitative results.