UniLat3D: Geometry-Appearance Unified Latents for Single-Stage 3D Generation

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing high-fidelity 3D generation methods predominantly rely on two-stage diffusion pipelines, suffering from geometry-appearance misalignment and high computational overhead. To address this, we propose the first single-stage, end-to-end 3D generation framework. Our approach introduces a geometry-appearance unified variational autoencoder (UniVAE) that constructs a compact latent space—UniLat—capable of encoding sparse, high-resolution features into joint geometric and appearance representations. Coupled with a flow matching model, our method directly maps Gaussian noise to UniLat, enabling sub-second generation of 3D Gaussians or meshes from a single input image. Trained exclusively on public datasets, our framework achieves state-of-the-art performance in geometric accuracy, appearance fidelity, and inference efficiency. Notably, it is the first to eliminate structural-textural mismatch inherently induced by sequential modeling.

Technology Category

Application Category

📝 Abstract

High-fidelity 3D asset generation is crucial for various industries. While recent 3D pretrained models show strong capability in producing realistic content, most are built upon diffusion models and follow a two-stage pipeline that first generates geometry and then synthesizes appearance. Such a decoupled design tends to produce geometry-texture misalignment and non-negligible cost. In this paper, we propose UniLat3D, a unified framework that encodes geometry and appearance in a single latent space, enabling direct single-stage generation. Our key contribution is a geometry-appearance Unified VAE, which compresses high-resolution sparse features into a compact latent representation -- UniLat. UniLat integrates structural and visual information into a dense low-resolution latent, which can be efficiently decoded into diverse 3D formats, e.g., 3D Gaussians and meshes. Based on this unified representation, we train a single flow-matching model to map Gaussian noise directly into UniLat, eliminating redundant stages. Trained solely on public datasets, UniLat3D produces high-quality 3D assets in seconds from a single image, achieving superior appearance fidelity and geometric quality. More demos & code are available at https://unilat3d.github.io/

Problem

Research questions and friction points this paper is trying to address.

Unifies geometry and appearance in single latent space

Eliminates geometry-texture misalignment in 3D generation

Enables direct single-stage 3D asset generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified VAE encodes geometry and appearance in single latent space

Compresses sparse features into compact dense UniLat representation

Single flow-matching model directly generates unified 3D representation

🔎 Similar Papers

LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation