OctGPT: Octree-based Multiscale Autoregressive Models for 3D Shape Generation

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing autoregressive models for 3D shape generation suffer from significantly lower efficiency and quality compared to diffusion-based approaches. Method: We propose OctGPT—the first octree-based, multi-scale autoregressive generative framework—jointly leveraging VQVAE and octree serialization to encode 3D shapes into compact binary sequences. It introduces an octree-structured Transformer, 3D rotation-equivariant positional encoding, scale-specific embeddings, and a novel token-level parallel decoding mechanism. Contribution/Results: OctGPT achieves 13× faster training and 69× faster inference over prior autoregressive methods. It enables high-resolution (1024³) training within days using only four RTX 4090 GPUs. Quantitative and qualitative evaluations demonstrate that OctGPT matches or surpasses state-of-the-art diffusion models in generation fidelity, while supporting diverse conditional inputs—including text, sketches, images, and scene-level contexts—establishing a new paradigm for efficient, high-fidelity autoregressive 3D generation.

Technology Category

Application Category

📝 Abstract
Autoregressive models have achieved remarkable success across various domains, yet their performance in 3D shape generation lags significantly behind that of diffusion models. In this paper, we introduce OctGPT, a novel multiscale autoregressive model for 3D shape generation that dramatically improves the efficiency and performance of prior 3D autoregressive approaches, while rivaling or surpassing state-of-the-art diffusion models. Our method employs a serialized octree representation to efficiently capture the hierarchical and spatial structures of 3D shapes. Coarse geometry is encoded via octree structures, while fine-grained details are represented by binary tokens generated using a vector quantized variational autoencoder (VQVAE), transforming 3D shapes into compact multiscale binary sequences suitable for autoregressive prediction. To address the computational challenges of handling long sequences, we incorporate octree-based transformers enhanced with 3D rotary positional encodings, scale-specific embeddings, and token-parallel generation schemes. These innovations reduce training time by 13 folds and generation time by 69 folds, enabling the efficient training of high-resolution 3D shapes, e.g.,$1024^3$, on just four NVIDIA 4090 GPUs only within days. OctGPT showcases exceptional versatility across various tasks, including text-, sketch-, and image-conditioned generation, as well as scene-level synthesis involving multiple objects. Extensive experiments demonstrate that OctGPT accelerates convergence and improves generation quality over prior autoregressive methods, offering a new paradigm for high-quality, scalable 3D content creation. Our code and trained models are available at https://github.com/octree-nn/octgpt.
Problem

Research questions and friction points this paper is trying to address.

Improving efficiency in 3D shape generation using autoregressive models
Addressing computational challenges with long sequences in 3D modeling
Enhancing quality and scalability of 3D content creation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Octree-based multiscale autoregressive 3D modeling
VQVAE for compact multiscale binary sequences
Octree transformers with 3D encodings
🔎 Similar Papers
No similar papers found.