HOTS3D: Hyper-Spherical Optimal Transport for Semantic Alignment of Text-to-3D Generation

📅 2024-07-19
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing CLIP-guided text-to-3D generation methods suffer from semantic misalignment between textual and visual embeddings, limiting their ability to synthesize high-fidelity 3D shapes. To address this, we propose a semantic alignment framework based on spherical optimal transport (SOT). Our method is the first to introduce spherical optimal transport into text-to-3D generation; it leverages Villani’s theorem to derive an analytical alignment solution for high-dimensional hyperspherical distributions, thereby avoiding manifold mapping errors. We further pioneer the use of input-convex neural networks (ICNNs) to parameterize the Kantorovich potential function. Integrated with CLIP’s multimodal encodings, diffusion-based priors, and a NeRF decoder, our approach achieves state-of-the-art performance across multiple benchmarks—particularly excelling in text semantic fidelity and geometric detail consistency.

Technology Category

Application Category

📝 Abstract
Recent CLIP-guided 3D generation methods have achieved promising results but struggle with generating faithful 3D shapes that conform with input text due to the gap between text and image embeddings. To this end, this paper proposes HOTS3D which makes the first attempt to effectively bridge this gap by aligning text features to the image features with spherical optimal transport (SOT). However, in high-dimensional situations, solving the SOT remains a challenge. To obtain the SOT map for high-dimensional features obtained from CLIP encoding of two modalities, we mathematically formulate and derive the solution based on Villani's theorem, which can directly align two hyper-sphere distributions without manifold exponential maps. Furthermore, we implement it by leveraging input convex neural networks (ICNNs) for the optimal Kantorovich potential. With the optimally mapped features, a diffusion-based generator and a Nerf-based decoder are subsequently utilized to transform them into 3D shapes. Extensive qualitative and qualitative comparisons with state-of-the-arts demonstrate the superiority of the proposed HOTS3D for 3D shape generation, especially on the consistency with text semantics.
Problem

Research questions and friction points this paper is trying to address.

Bridging text-image embedding gap in 3D generation
Solving high-dimensional spherical optimal transport challenges
Enhancing text-to-3D semantic alignment accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses spherical optimal transport for alignment
Derives solution via Villani's theorem mathematically
Implements ICNNs for optimal Kantorovich potential
🔎 Similar Papers
No similar papers found.
Zezeng Li
Zezeng Li
Dalian University of Technology
Computer VisionGenerative Model
W
Weimin Wang
International Information and Software Institute, Dalian University of Technology, Dalian, 116620, China
W
WenHai Li
School of Software, Dalian University of Technology, Dalian, 116620, China
N
Na Lei
International Information and Software Institute, Dalian University of Technology, Dalian, 116620, China
X
Xianfeng Gu
Department of Computer Science and Applied Mathematics, State University of New York at Stony Brook, Stony Brook, NY 11794-2424, USA