Collaborative Multi-Modal Coding for High-Quality 3D Generation

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D-native generative models are largely constrained to single-modality inputs or purely geometric representations, failing to fully exploit the complementary texture and geometric information inherent in multimodal data—such as RGB, RGB-D, and point clouds—while also suffering from limited available training data. To address these limitations, we propose TriMM, the first feed-forward 3D-native generative framework. TriMM introduces a collaborative multimodal encoding mechanism for end-to-end fusion of heterogeneous representations, coupled with a triplane latent diffusion architecture and cross-modal 2D/3D supervision. Evaluated under data-scarce settings, TriMM achieves high-fidelity 3D asset generation, significantly improving both texture sharpness and geometric fidelity. Our approach demonstrates that synergistic multimodal modeling is both effective and feasible for advancing 3D generative capabilities, offering a scalable paradigm beyond monomodal or geometry-only methods.

Technology Category

Application Category

📝 Abstract
3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.
Problem

Research questions and friction points this paper is trying to address.

Integrating multi-modal data for 3D generation
Overcoming single-modality limitations in 3D modeling
Enhancing texture and geometric detail quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaborative multi-modal coding for 3D generation
Auxiliary 2D and 3D supervision for robustness
Triplane latent diffusion model for superior quality
🔎 Similar Papers
No similar papers found.