🤖 AI Summary
This work addresses the challenge of maintaining object identity stability in unsupervised video object-centric learning under rapid motion or occlusion, where existing slot-based methods suffer from identity instability due to the entanglement of appearance and identity information and slot swapping caused by weakly matched slots. To resolve this, the authors propose Dual-State Slot Attention (DSSA), which explicitly decouples each slot into a local appearance state and a persistent identity state, optimizing intra-frame reconstruction and inter-frame consistency separately. Additionally, Competitive Modulation Aggregation (CMA) is introduced to suppress interference from weakly matched slots and prevent erroneous token assignment. The method achieves state-of-the-art performance on MOVi-C, MOVi-D, and YouTube-VIS, demonstrating superior results in segmentation quality, temporal consistency, downstream recognition, and dynamic prediction tasks.
📝 Abstract
Unsupervised video object-centric learning aims to decompose dynamic scenes into persistent, object-level representations without supervision. However, existing slot-based methods struggle to maintain stable object identity in challenging settings such as rapid motion and partial occlusion. First, they typically encode both the per-frame appearance of an object and its identity across frames in a single slot vector, creating an objective conflict that leads to slot swapping: reconstruction requires sensitivity to transient visual changes, whereas temporal consistency requires invariance to them. Second, the token renormalization used in Slot Attention can amplify weakly attending slots, allowing them to absorb tokens from other objects and destabilize slot-to-object correspondence.
We propose Dual-State Slot Attention (DSSA), a fully self-supervised framework that addresses these limitations by separating appearance from identity and by reducing spurious updates from weakly matching slots. DSSA decomposes each slot into a local state for per-frame appearance and an identity state for temporally stable object information, thereby aligning reconstruction and temporal consistency with separate representations. The identity state is updated through a learned recurrent transition that acts as a temporal filter on the local state, while competition-modulated aggregation (CMA) down-weights updates from weakly matching slots and prevents them from absorbing tokens from other objects. Experiments on MOVi-C, MOVi-D, and YouTube-VIS demonstrate that DSSA consistently improves segmentation quality and temporal consistency over prior methods, while also yielding stronger downstream object recognition and video dynamics prediction. Code and models will be made publicly available upon acceptance.