DIMOS: Disentangling Instance-level Moving Object Segmentation

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing motion instance segmentation methods that fuse event camera data with conventional images struggle to effectively handle small-scale moving objects, and their performance is limited by the entanglement of appearance and motion cues within event features, which hinders effective cross-modal fusion. To address these challenges, this work proposes a dual-decoupled feature extraction framework that explicitly disentangles appearance and motion information from both image and event modalities. Furthermore, a multi-granularity cross-modal alignment mechanism is introduced to achieve feature fusion with consistent distributional and semantic representations. This approach is the first to incorporate both intra-modal decoupling and multi-granularity alignment in multimodal motion instance segmentation, significantly outperforming state-of-the-art methods—particularly in challenging scenarios involving small objects, high-speed motion, or low-light conditions—and achieving leading performance.

📝 Abstract

Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS. However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective cross-modal fusion. To address these challenges, we first propose a dual-disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi-granularity cross-modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details. The experiment results demonstrate that our method achieves state-of-the-art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low-light settings.

Problem

Research questions and friction points this paper is trying to address.

moving instance segmentation

event cameras

feature disentanglement

multimodal fusion

small object segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

disentangled representation

event camera

moving instance segmentation