YouTube-Occ: Learning Indoor 3D Semantic Occupancy Prediction from YouTube Videos

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing indoor 3D semantic occupancy prediction methods rely heavily on precise camera parameters and large-scale, pixel-accurate 3D annotations—making them costly and impractical to scale. Method: This paper introduces the first fully self-supervised framework, trained exclusively on unlabeled, camera-parameter-free indoor internet videos (e.g., YouTube). It distills semantic knowledge from a 2D vision foundation model (VFM) into 3D space via superpixel-guided aggregation, enabling end-to-end learning without geometric priors. Contribution/Results: We release YouTube-Occ, the first large-scale self-supervised indoor occupancy dataset derived from web videos. Our method achieves state-of-the-art zero-shot transfer performance on NYUv2 and OccScanNet—demonstrating, for the first time, that high-fidelity 3D semantic occupancy learning is feasible using only uncurated online video, thereby drastically reducing data acquisition and annotation overhead.

Technology Category

Application Category

📝 Abstract
3D semantic occupancy prediction in the past was considered to require precise geometric relationships in order to enable effective training. However, in complex indoor environments, the large-scale and widespread collection of data, along with the necessity for fine-grained annotations, becomes impractical due to the complexity of data acquisition setups and privacy concerns. In this paper, we demonstrate that 3D spatially-accurate training can be achieved using only indoor Internet data, without the need for any pre-knowledge of intrinsic or extrinsic camera parameters. In our framework, we collect a web dataset, YouTube-Occ, which comprises house tour videos from YouTube, providing abundant real house scenes for 3D representation learning. Upon on this web dataset, we establish a fully self-supervised model to leverage accessible 2D prior knowledge for reaching powerful 3D indoor perception. Specifically, we harness the advantages of the prosperous vision foundation models, distilling the 2D region-level knowledge into the occupancy network by grouping the similar pixels into superpixels. Experimental results show that our method achieves state-of-the-art zero-shot performance on two popular benchmarks (NYUv2 and OccScanNet
Problem

Research questions and friction points this paper is trying to address.

Learning 3D semantic occupancy from YouTube videos
Overcoming need for precise geometric relationships
Self-supervised 3D perception without camera parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses YouTube videos for 3D occupancy training
Self-supervised model with 2D prior knowledge
Distills 2D region-level knowledge via superpixels
Haoming Chen
Haoming Chen
East China Normal University
Computer visionDeep learningHuman pose estimation3D scene understanding
L
Lichen Yuan
East China Normal University
T
TianFang Sun
East China Normal University
Jingyu Gong
Jingyu Gong
Shanghai Jiao Tong University
3D Computer Vision
X
Xin Tan
East China Normal University
Zhizhong Zhang
Zhizhong Zhang
Associate Researcher, East China Normal University
Computer Vision
Y
Yuan Xie
East China Normal University