🤖 AI Summary
This work addresses the problem of per-object depth estimation from monocular images—a critical capability for surveillance and autonomous driving. We propose Masked Object Modeling (MOM), a novel paradigm that departs from conventional methods relying on geometric priors or ground-truth depth supervision. MOM pioneers the adaptation of masked language modeling to monocular depth estimation, framing instance masks as self-supervised reconstruction units to enable label-free, object-level distance learning. Key methodological contributions include object-aware positional encoding, cross-scale feature alignment, and a differentiable depth decoder. Built upon a Transformer architecture, MOM integrates multi-scale features and optimizes via self-supervised reconstruction loss. Evaluated on ScanNet and NYUv2, MOM achieves state-of-the-art performance, reducing per-object relative depth error by 18.7% over prior art. It operates solely on RGB input and supports real-time inference in open-world scenes.