🤖 AI Summary
This work addresses instance-level segmentation of moving objects in dynamic scenes via fusion of single RGB images and event streams, tackling two key challenges: ambiguity between camera and object motion, and imprecise masks arising from the texture-sparse nature of event data. We propose an implicit cross-modal mask attention mechanism to achieve fine-grained alignment between RGB and event features; introduce explicit contrastive feature learning to enhance motion discriminability; and incorporate optical-flow-guided motion modeling to decouple segmentation from motion classification—enabling handling of an arbitrary number of independently moving objects. Our method adopts a two-stage architecture and achieves state-of-the-art performance across multiple benchmarks while supporting real-time inference. The code and pre-trained models are publicly available.
📝 Abstract
Moving object segmentation plays a crucial role in understanding dynamic scenes involving multiple moving objects, while the difficulties lie in taking into account both spatial texture structures and temporal motion cues. Existing methods based on video frames encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion due to the complexities of accurate image-based motion modeling. Recent advances exploit the motion sensitivity of novel event cameras to counter conventional images' inadequate motion modeling capabilities, but instead lead to challenges in segmenting pixel-level object masks due to the lack of dense texture structures in events. To address these two limitations imposed by unimodal settings, we propose the first instance-level moving object segmentation framework that integrates complementary texture and motion cues. Our model incorporates implicit cross-modal masked attention augmentation, explicit contrastive feature learning, and flow-guided motion enhancement to exploit dense texture information from a single image and rich motion information from events, respectively. By leveraging the augmented texture and motion features, we separate mask segmentation from motion classification to handle varying numbers of independently moving objects. Through extensive evaluations on multiple datasets, as well as ablation experiments with different input settings and real-time efficiency analysis of the proposed framework, we believe that our first attempt to incorporate image and event data for practical deployment can provide new insights for future work in event-based motion related works. The source code with model training and pre-trained weights is released at https://npucvr.github.io/EvInsMOS