🤖 AI Summary
Existing autoregressive models for 3D shape generation suffer from significantly lower efficiency and quality compared to diffusion-based approaches.
Method: We propose OctGPT—the first octree-based, multi-scale autoregressive generative framework—jointly leveraging VQVAE and octree serialization to encode 3D shapes into compact binary sequences. It introduces an octree-structured Transformer, 3D rotation-equivariant positional encoding, scale-specific embeddings, and a novel token-level parallel decoding mechanism.
Contribution/Results: OctGPT achieves 13× faster training and 69× faster inference over prior autoregressive methods. It enables high-resolution (1024³) training within days using only four RTX 4090 GPUs. Quantitative and qualitative evaluations demonstrate that OctGPT matches or surpasses state-of-the-art diffusion models in generation fidelity, while supporting diverse conditional inputs—including text, sketches, images, and scene-level contexts—establishing a new paradigm for efficient, high-fidelity autoregressive 3D generation.
📝 Abstract
Autoregressive models have achieved remarkable success across various domains, yet their performance in 3D shape generation lags significantly behind that of diffusion models. In this paper, we introduce OctGPT, a novel multiscale autoregressive model for 3D shape generation that dramatically improves the efficiency and performance of prior 3D autoregressive approaches, while rivaling or surpassing state-of-the-art diffusion models. Our method employs a serialized octree representation to efficiently capture the hierarchical and spatial structures of 3D shapes. Coarse geometry is encoded via octree structures, while fine-grained details are represented by binary tokens generated using a vector quantized variational autoencoder (VQVAE), transforming 3D shapes into compact multiscale binary sequences suitable for autoregressive prediction. To address the computational challenges of handling long sequences, we incorporate octree-based transformers enhanced with 3D rotary positional encodings, scale-specific embeddings, and token-parallel generation schemes. These innovations reduce training time by 13 folds and generation time by 69 folds, enabling the efficient training of high-resolution 3D shapes, e.g.,$1024^3$, on just four NVIDIA 4090 GPUs only within days. OctGPT showcases exceptional versatility across various tasks, including text-, sketch-, and image-conditioned generation, as well as scene-level synthesis involving multiple objects. Extensive experiments demonstrate that OctGPT accelerates convergence and improves generation quality over prior autoregressive methods, offering a new paradigm for high-quality, scalable 3D content creation. Our code and trained models are available at https://github.com/octree-nn/octgpt.