🤖 AI Summary
Existing feed-forward 3D Gaussian splatting methods struggle to balance primitive redundancy against rendering quality and lack effective spatial modeling mechanisms. This work proposes a Transformer-based feed-forward Gaussian splatting architecture that introduces a novel Z-order spatial sorting strategy to transform unordered Gaussian primitives into a structurally coherent sequence. By integrating sparse attention mechanisms, the model efficiently captures long-range contextual relationships. The approach predicts high-quality Gaussian attributes in a single forward pass while adaptively suppressing redundant primitives, significantly reducing the number of Gaussians without compromising critical geometric details. Consequently, it achieves efficient and high-fidelity novel view synthesis, outperforming current feed-forward methods.
📝 Abstract
Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in photorealistic novel view synthesis. However, traditional 3DGS relies on a slow, iterative optimization process, which limits its use in scenarios demanding real-time results. To overcome this bottleneck, recent feed-forward methods aim to predict Gaussian attributes directly from images, but they often struggle with the redundancy of Gaussian primitives and rendering quality. In this work, we introduce a transformer-based architecture specifically designed for feed-forward Gaussian Splatting. Our key insight is that spatial and semantic relationships among Gaussians can be effectively captured through a sparse attention mechanism, enabled by a Z-order strategy that organizes the unstructured Gaussian set into a spatially coherent sequence. Furthermore, we incorporate this Z-order strategy to adaptively suppress redundancy while preserving critical structural details. This allows the transformer to efficiently model context, compress Gaussian primitives, and predict Gaussian attributes in a single forward pass. Comprehensive experiments demonstrate that our method achieves fast and high-quality novel view synthesis with fewer Gaussian primitives.