🤖 AI Summary
This work addresses the challenge of cross-modal generalization between RGB images and event streams in learning-based event camera methods. The authors propose projecting event data into the frozen latent space of a pretrained RGB foundation model (e.g., MASt3R) and employ low-rank adaptation (LoRA) to construct a shared manifold that aligns the two modalities. This approach enables direct zero-shot application of sophisticated image decoders to raw event data without task-specific training, leveraging the geometric and semantic priors embedded in the RGB backbone. Furthermore, it supports linear head transfer for downstream tasks. The method achieves state-of-the-art performance in wide-baseline feature matching, significantly outperforming specialized architectures, and demonstrates successful zero-shot transfer to tasks such as depth estimation and semantic segmentation.
📝 Abstract
Event cameras provide several unique advantages over standard frame-based sensors, including high temporal resolution, low latency, and robustness to extreme lighting. However, existing learning-based approaches for event processing are typically confined to narrow, task-specific silos and lack the ability to generalize across modalities. We address this gap with REALM, a cross-modal framework that learns an RGB and Event Aligned Latent Manifold by projecting event representations into the pretrained latent space of RGB foundation models. Instead of task-specific training, we leverage low-rank adaptation (LoRA) to bridge the modality gap, effectively unlocking the geometric and semantic priors of frozen RGB backbones for asynchronous event streams. We demonstrate that REALM effectively maps events into the ViT-based foundation latent space. Our method allows us to perform downstream tasks like depth estimation and semantic segmentation by simply transferring linear heads trained on the RGB teacher. Most significantly, REALM enables the direct, zero-shot application of complex, frozen image-trained decoders, such as MASt3R, to raw event data. We demonstrate state-of-the-art performance in wide-baseline feature matching, significantly outperforming specialized architectures. Code and models are available upon acceptance.