Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation

πŸ“… 2025-09-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Supervised monocular depth estimation from event cameras is hindered by the scarcity of dense, ground-truth depth annotations. Method: This paper proposes a cross-modal distillation framework that leverages vision foundation models (e.g., Depth Anything v2) to generate high-fidelity pseudo-depth labels for temporally and spatially aligned RGB-event data, coupled with a lightweight recurrent neural network architecture explicitly designed to capture the spatiotemporal dynamics of event streams. Contribution/Results: To our knowledge, this is the first work enabling effective transfer of large-scale pre-trained vision models to the event domain without any real depth supervision. Evaluated on both synthetic and real-world benchmarks, the method achieves performance on par with fully supervised state-of-the-art approaches, drastically reducing annotation overhead and advancing event-driven depth estimation toward practical deployment.

Technology Category

Application Category

πŸ“ Abstract
Event cameras capture sparse, high-temporal-resolution visual information, making them particularly suitable for challenging environments with high-speed motion and strongly varying lighting conditions. However, the lack of large datasets with dense ground-truth depth annotations hinders learning-based monocular depth estimation from event data. To address this limitation, we propose a cross-modal distillation paradigm to generate dense proxy labels leveraging a Vision Foundation Model (VFM). Our strategy requires an event stream spatially aligned with RGB frames, a simple setup even available off-the-shelf, and exploits the robustness of large-scale VFMs. Additionally, we propose to adapt VFMs, either a vanilla one like Depth Anything v2 (DAv2), or deriving from it a novel recurrent architecture to infer depth from monocular event cameras. We evaluate our approach with synthetic and real-world datasets, demonstrating that i) our cross-modal paradigm achieves competitive performance compared to fully supervised methods without requiring expensive depth annotations, and ii) our VFM-based models achieve state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

Monocular depth estimation from sparse event camera data
Lack of large datasets with dense depth annotations
Cross-modal distillation to generate proxy depth labels
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal distillation for depth estimation
Leveraging Vision Foundation Models robustness
Novel recurrent architecture from Depth Anything
πŸ”Ž Similar Papers
No similar papers found.