SS3D: End2End Self-Supervised 3D from Web Videos

📅 2026-04-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This work addresses the challenges of weak multi-view constraints and high data heterogeneity in end-to-end self-supervised monocular 3D reconstruction from unconstrained internet videos. The authors propose a Structure-from-Motion (SfM)-based self-supervised pretraining framework that jointly predicts depth, ego-motion, and camera intrinsics in a single forward pass. To enhance training stability and scalability, they introduce proxy multi-view signals for data filtering and curriculum sampling, combined with a two-stage training strategy and knowledge distillation. This approach enables, for the first time, efficient self-supervised 3D pretraining at the scale of YouTube-8M. The method significantly outperforms existing approaches in both cross-domain zero-shot transfer and fine-tuning performance, and the pretrained models and code are publicly released.

Technology Category

Application Category

📝 Abstract
We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.
Problem

Research questions and friction points this paper is trying to address.

self-supervised 3D
monocular video
structure-from-motion
web-scale learning
multi-view observability
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised 3D
monocular video
structure-from-motion (SfM)
multi-view signal proxy
end-to-end depth estimation
🔎 Similar Papers
No similar papers found.