π€ AI Summary
Supervised monocular depth estimation from event cameras is hindered by the scarcity of dense, ground-truth depth annotations. Method: This paper proposes a cross-modal distillation framework that leverages vision foundation models (e.g., Depth Anything v2) to generate high-fidelity pseudo-depth labels for temporally and spatially aligned RGB-event data, coupled with a lightweight recurrent neural network architecture explicitly designed to capture the spatiotemporal dynamics of event streams. Contribution/Results: To our knowledge, this is the first work enabling effective transfer of large-scale pre-trained vision models to the event domain without any real depth supervision. Evaluated on both synthetic and real-world benchmarks, the method achieves performance on par with fully supervised state-of-the-art approaches, drastically reducing annotation overhead and advancing event-driven depth estimation toward practical deployment.
π Abstract
Event cameras capture sparse, high-temporal-resolution visual information, making them particularly suitable for challenging environments with high-speed motion and strongly varying lighting conditions. However, the lack of large datasets with dense ground-truth depth annotations hinders learning-based monocular depth estimation from event data. To address this limitation, we propose a cross-modal distillation paradigm to generate dense proxy labels leveraging a Vision Foundation Model (VFM). Our strategy requires an event stream spatially aligned with RGB frames, a simple setup even available off-the-shelf, and exploits the robustness of large-scale VFMs. Additionally, we propose to adapt VFMs, either a vanilla one like Depth Anything v2 (DAv2), or deriving from it a novel recurrent architecture to infer depth from monocular event cameras. We evaluate our approach with synthetic and real-world datasets, demonstrating that i) our cross-modal paradigm achieves competitive performance compared to fully supervised methods without requiring expensive depth annotations, and ii) our VFM-based models achieve state-of-the-art performance.