🤖 AI Summary
Existing trimap-free video matting methods lack deterministic foreground-background cues, resulting in poor temporal consistency and blurred edge details. To address this, we propose an object-aware video matting framework. First, we introduce the Object-Guided Cross-Frame Correction and Refinement (OGCR) module—the first of its kind—to enable dynamic aggregation from instance-level semantics to pixel-level features across frames. Second, we design a sequential foreground fusion enhancement strategy to improve robustness in complex, dynamic scenes. Third, we develop an end-to-end architecture integrating cross-frame feature alignment, instance-aware attention, and sequence-aware data augmentation. Requiring only a coarse mask on the first frame, our method achieves state-of-the-art performance on both synthetic and real-world benchmarks, with significant improvements in temporal stability and edge accuracy.
📝 Abstract
Recently, trimap-free methods have drawn increasing attention in human video matting due to their promising performance. Nevertheless, these methods still suffer from the lack of deterministic foreground-background cues, which impairs their ability to consistently identify and locate foreground targets over time and mine fine-grained details. In this paper, we present a trimap-free Object-Aware Video Matting (OAVM) framework, which can perceive different objects, enabling joint recognition of foreground objects and refinement of edge details. Specifically, we propose an Object-Guided Correction and Refinement (OGCR) module, which employs cross-frame guidance to aggregate object-level instance information into pixel-level detail features, thereby promoting their synergy. Furthermore, we design a Sequential Foreground Merging augmentation strategy to diversify sequential scenarios and enhance capacity of the network for object discrimination. Extensive experiments on recent widely used synthetic and real-world benchmarks demonstrate the state-of-the-art performance of our OAVM with only an initial coarse mask. The code and model will be available.