Monocular Per-Object Distance Estimation with Masked Object Modeling

📅 2024-01-06
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of per-object depth estimation from monocular images—a critical capability for surveillance and autonomous driving. We propose Masked Object Modeling (MOM), a novel paradigm that departs from conventional methods relying on geometric priors or ground-truth depth supervision. MOM pioneers the adaptation of masked language modeling to monocular depth estimation, framing instance masks as self-supervised reconstruction units to enable label-free, object-level distance learning. Key methodological contributions include object-aware positional encoding, cross-scale feature alignment, and a differentiable depth decoder. Built upon a Transformer architecture, MOM integrates multi-scale features and optimizes via self-supervised reconstruction loss. Evaluated on ScanNet and NYUv2, MOM achieves state-of-the-art performance, reducing per-object relative depth error by 18.7% over prior art. It operates solely on RGB input and supports real-time inference in open-world scenes.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

Monocular per-object distance estimation challenges in autonomous driving.
Self-supervised learning for individual object distance estimation.
Robust distance estimation with occluded or poorly detected objects.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Object Modeling technique
Single unified training stage
Enhanced zero-shot and few-shot capabilities
🔎 Similar Papers
No similar papers found.