π€ AI Summary
To address the failure of conventional frame-based stereo depth estimation in dynamic scenes, this paper proposes the first end-to-end method for dense depth estimation directly from asynchronous binocular event streams. Our approach follows a fusion-refinement paradigm: a recurrent spiking neural network (RSNN) is designed to jointly fuse spatiotemporal event information across views and iteratively refine depth predictions. Key contributions include: (1) the first end-to-end stereo matching framework explicitly designed for event data; (2) the first large-scale synthetic and real-world stereo event dataset with dense ground-truth depth annotations; and (3) significantly improved robustness in challenging regionsβe.g., textureless surfaces and high-illumination conditions. Experiments demonstrate that our method consistently outperforms existing approaches on our benchmark dataset. Remarkably, it retains over 90% of its full-data accuracy using only 20% of the training data. The code and dataset will be made publicly available.
π Abstract
Conventional frame-based cameras often struggle with stereo depth estimation in rapidly changing scenes. In contrast, bio-inspired spike cameras emit asynchronous events at microsecond-level resolution, providing an alternative sensing modality. However, existing methods lack specialized stereo algorithms and benchmarks tailored to the spike data. To address this gap, we propose SpikeStereoNet, a brain-inspired framework and the first to estimate stereo depth directly from raw spike streams. The model fuses raw spike streams from two viewpoints and iteratively refines depth estimation through a recurrent spiking neural network (RSNN) update module. To benchmark our approach, we introduce a large-scale synthetic spike stream dataset and a real-world stereo spike dataset with dense depth annotations. SpikeStereoNet outperforms existing methods on both datasets by leveraging spike streams' ability to capture subtle edges and intensity shifts in challenging regions such as textureless surfaces and extreme lighting conditions. Furthermore, our framework exhibits strong data efficiency, maintaining high accuracy even with substantially reduced training data. The source code and datasets will be publicly available.