🤖 AI Summary
Reconstructing structured 3D building models from extremely sparse and low-quality point clouds remains challenging.
Method: We propose a domain-specific language (DSL)-based procedural modeling framework—the first building-oriented procedural DSL—unifying forward program synthesis and inverse program inference to enable end-to-end mapping from multi-modal inputs (e.g., point clouds, multi-view images, natural language) to parseable, editable building programs. Our architecture jointly employs a 3D convolutional encoder and a Transformer decoder to tokenize and generate program sequences.
Contribution/Results: Compared to conventional proxy-based reconstruction and learning-based abstraction methods, our approach achieves significant improvements in both geometric fidelity and structural plausibility while maintaining efficient inference. It establishes a novel paradigm for editable 3D content generation from low-fidelity perceptual data.
📝 Abstract
We introduce ArcPro, a novel learning framework built on architectural programs to recover structured 3D abstractions from highly sparse and low-quality point clouds. Specifically, we design a domain-specific language (DSL) to hierarchically represent building structures as a program, which can be efficiently converted into a mesh. We bridge feedforward and inverse procedural modeling by using a feedforward process for training data synthesis, allowing the network to make reverse predictions. We train an encoder-decoder on the points-program pairs to establish a mapping from unstructured point clouds to architectural programs, where a 3D convolutional encoder extracts point cloud features and a transformer decoder autoregressively predicts the programs in a tokenized form. Inference by our method is highly efficient and produces plausible and faithful 3D abstractions. Comprehensive experiments demonstrate that ArcPro outperforms both traditional architectural proxy reconstruction and learning-based abstraction methods. We further explore its potential to work with multi-view image and natural language inputs.