HOTS3D: Hyper-Spherical Optimal Transport for Semantic Alignment of Text-to-3D Generation

📅 2024-07-19

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing CLIP-guided text-to-3D generation methods suffer from semantic misalignment between textual and visual embeddings, limiting their ability to synthesize high-fidelity 3D shapes. To address this, we propose a semantic alignment framework based on spherical optimal transport (SOT). Our method is the first to introduce spherical optimal transport into text-to-3D generation; it leverages Villani’s theorem to derive an analytical alignment solution for high-dimensional hyperspherical distributions, thereby avoiding manifold mapping errors. We further pioneer the use of input-convex neural networks (ICNNs) to parameterize the Kantorovich potential function. Integrated with CLIP’s multimodal encodings, diffusion-based priors, and a NeRF decoder, our approach achieves state-of-the-art performance across multiple benchmarks—particularly excelling in text semantic fidelity and geometric detail consistency.

Technology Category

Application Category

📝 Abstract

Recent CLIP-guided 3D generation methods have achieved promising results but struggle with generating faithful 3D shapes that conform with input text due to the gap between text and image embeddings. To this end, this paper proposes HOTS3D which makes the first attempt to effectively bridge this gap by aligning text features to the image features with spherical optimal transport (SOT). However, in high-dimensional situations, solving the SOT remains a challenge. To obtain the SOT map for high-dimensional features obtained from CLIP encoding of two modalities, we mathematically formulate and derive the solution based on Villani's theorem, which can directly align two hyper-sphere distributions without manifold exponential maps. Furthermore, we implement it by leveraging input convex neural networks (ICNNs) for the optimal Kantorovich potential. With the optimally mapped features, a diffusion-based generator and a Nerf-based decoder are subsequently utilized to transform them into 3D shapes. Extensive qualitative and qualitative comparisons with state-of-the-arts demonstrate the superiority of the proposed HOTS3D for 3D shape generation, especially on the consistency with text semantics.

Problem

Research questions and friction points this paper is trying to address.

Bridging text-image embedding gap in 3D generation

Solving high-dimensional spherical optimal transport challenges

Enhancing text-to-3D semantic alignment accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses spherical optimal transport for alignment

Derives solution via Villani's theorem mathematically

Implements ICNNs for optimal Kantorovich potential

🔎 Similar Papers

No similar papers found.