🤖 AI Summary
Existing high-fidelity 3D generation methods predominantly rely on two-stage diffusion pipelines, suffering from geometry-appearance misalignment and high computational overhead. To address this, we propose the first single-stage, end-to-end 3D generation framework. Our approach introduces a geometry-appearance unified variational autoencoder (UniVAE) that constructs a compact latent space—UniLat—capable of encoding sparse, high-resolution features into joint geometric and appearance representations. Coupled with a flow matching model, our method directly maps Gaussian noise to UniLat, enabling sub-second generation of 3D Gaussians or meshes from a single input image. Trained exclusively on public datasets, our framework achieves state-of-the-art performance in geometric accuracy, appearance fidelity, and inference efficiency. Notably, it is the first to eliminate structural-textural mismatch inherently induced by sequential modeling.
📝 Abstract
High-fidelity 3D asset generation is crucial for various industries. While recent 3D pretrained models show strong capability in producing realistic content, most are built upon diffusion models and follow a two-stage pipeline that first generates geometry and then synthesizes appearance. Such a decoupled design tends to produce geometry-texture misalignment and non-negligible cost. In this paper, we propose UniLat3D, a unified framework that encodes geometry and appearance in a single latent space, enabling direct single-stage generation. Our key contribution is a geometry-appearance Unified VAE, which compresses high-resolution sparse features into a compact latent representation -- UniLat. UniLat integrates structural and visual information into a dense low-resolution latent, which can be efficiently decoded into diverse 3D formats, e.g., 3D Gaussians and meshes. Based on this unified representation, we train a single flow-matching model to map Gaussian noise directly into UniLat, eliminating redundant stages. Trained solely on public datasets, UniLat3D produces high-quality 3D assets in seconds from a single image, achieving superior appearance fidelity and geometric quality. More demos & code are available at https://unilat3d.github.io/