🤖 AI Summary
Existing 3D scene generation methods suffer from high storage overhead, structural distortion, and insufficient cross-modal collaborative modeling, hindering simultaneous achievement of photorealism, geometric fidelity, and model lightweightness. To address these challenges, we propose a cross-modal progressive 3D scene generation framework tailored for virtual reality. Our method introduces a novel hierarchical deep prior regularization mechanism to enforce geometric consistency; designs a structured contextual guidance-based hash grid compression scheme for efficient scene representation; and integrates 3D Gaussian splatting with incremental point cloud reconstruction, unified via cross-modal feature alignment to jointly process text and image inputs. Experiments demonstrate that our approach significantly outperforms baselines across diverse scenes: generated scenes exhibit coherent structure and high geometric accuracy, while the model size is reduced by over 60%, substantially lowering memory and storage requirements.
📝 Abstract
With the widespread use of virtual reality applications, 3D scene generation has become a new challenging research frontier. 3D scenes have highly complex structures and need to ensure that the output is dense, coherent, and contains all necessary structures. Many current 3D scene generation methods rely on pre-trained text-to-image diffusion models and monocular depth estimators. However, the generated scenes occupy large amounts of storage space and often lack effective regularisation methods, leading to geometric distortions. To this end, we propose BloomScene, a lightweight structured 3D Gaussian splatting for crossmodal scene generation, which creates diverse and high-quality 3D scenes from text or image inputs. Specifically, a crossmodal progressive scene generation framework is proposed to generate coherent scenes utilizing incremental point cloud reconstruction and 3D Gaussian splatting. Additionally, we propose a hierarchical depth prior-based regularization mechanism that utilizes multi-level constraints on depth accuracy and smoothness to enhance the realism and continuity of the generated scenes. Ultimately, we propose a structured context-guided compression mechanism that exploits structured hash grids to model the context of unorganized anchor attributes, which significantly eliminates structural redundancy and reduces storage overhead. Comprehensive experiments across multiple scenes demonstrate the significant potential and advantages of our framework compared with several baselines.