🤖 AI Summary
To address weak appearance-style controllability and multi-view inconsistency in text/image-guided 3D generation, this paper proposes a zero-shot, single-step 3D model appearance stylization method. Without fine-tuning or optimization, it injects CLIP-extracted style features from a reference image directionally into the cross-layer attention modules of a pre-trained large-scale 3D reconstruction model (e.g., MVDiffusion), explicitly leveraging its implicitly encoded appearance representations for style transfer. The key contribution is the first discovery and exploitation of the implicit global-appearance modeling capability embedded in the attention mechanisms of 3D reconstruction models, enhanced by spatial alignment and CLIP-semantic guidance to ensure multi-view consistency. Experiments demonstrate that our method achieves state-of-the-art performance in visual quality, multi-view consistency, and inference speed, significantly outperforming existing style transfer and 3D editing approaches.
📝 Abstract
With the growing success of text or image guided 3D generators, users demand more control over the generation process, appearance stylization being one of them. Given a reference image, this requires adapting the appearance of a generated 3D asset to reflect the visual style of the reference while maintaining visual consistency from multiple viewpoints. To tackle this problem, we draw inspiration from the success of 2D stylization methods that leverage the attention mechanisms in large image generation models to capture and transfer visual style. In particular, we probe if large reconstruction models, commonly used in the context of 3D generation, has a similar capability. We discover that the certain attention blocks in these models capture the appearance specific features. By injecting features from a visual style image to such blocks, we develop a simple yet effective 3D appearance stylization method. Our method does not require training or test time optimization. Through both quantitative and qualitative evaluations, we demonstrate that our approach achieves superior results in terms of 3D appearance stylization, significantly improving efficiency while maintaining high-quality visual outcomes.