Fake It To Make It: Virtual Multiviews to Enhance Monocular Indoor Semantic Scene Completion

📅 2025-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Monocular indoor Semantic Scene Completion (SSC) suffers from 3D reconstruction ambiguities caused by depth ambiguity, scale misalignment, and severe occlusion—leading to object deformation or missing structures. To address these challenges, we propose GenFuSE, the first framework introducing virtual multi-view augmentation: it synthesizes complementary views to mitigate single-image depth uncertainty; designs a Multi-View Fusion Adapter (MVFA) for cross-view 3D feature alignment and aggregation; and explicitly identifies and models the novelty–consistency trade-off inherent in generative completion. Evaluated on NYUv2, GenFuSE achieves +2.8% scene completion IoU and +4.9% semantic scene completion IoU over prior methods. With its plug-and-play design, GenFuSE delivers a general, robust, and generative solution for monocular SSC.

Technology Category

Application Category

📝 Abstract
Monocular Indoor Semantic Scene Completion (SSC) aims to reconstruct a 3D semantic occupancy map from a single RGB image of an indoor scene, inferring spatial layout and object categories from 2D image cues. The challenge of this task arises from the depth, scale, and shape ambiguities that emerge when transforming a 2D image into 3D space, particularly within the complex and often heavily occluded environments of indoor scenes. Current SSC methods often struggle with these ambiguities, resulting in distorted or missing object representations. To overcome these limitations, we introduce an innovative approach that leverages novel view synthesis and multiview fusion. Specifically, we demonstrate how virtual cameras can be placed around the scene to emulate multiview inputs that enhance contextual scene information. We also introduce a Multiview Fusion Adaptor (MVFA) to effectively combine the multiview 3D scene predictions into a unified 3D semantic occupancy map. Finally, we identify and study the inherent limitation of generative techniques when applied to SSC, specifically the Novelty-Consistency tradeoff. Our system, GenFuSE, demonstrates IoU score improvements of up to 2.8% for Scene Completion and 4.9% for Semantic Scene Completion when integrated with existing SSC networks on the NYUv2 dataset. This work introduces GenFuSE as a standard framework for advancing monocular SSC with synthesized inputs.
Problem

Research questions and friction points this paper is trying to address.

Reconstruct 3D semantic occupancy from single RGB image.
Address depth, scale, and shape ambiguities in 2D-to-3D transformation.
Enhance SSC accuracy using virtual multiviews and fusion techniques.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses virtual cameras for multiview synthesis
Introduces Multiview Fusion Adaptor (MVFA)
Improves IoU scores on NYUv2 dataset