Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

πŸ“… 2025-11-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses zero-shot video object tracking without fine-tuning, proposing the DRIFT framework. Methodologically, it leverages the inherent cross-frame semantic propagation capability encoded in the self-attention maps of image diffusion models (e.g., Stable Diffusion) as pixel-level label propagation kernels. Specifically, inter-frame attention correlations are extracted via DDIM inversion; target specificity is enhanced through text inversion and adaptive head weighting; and segmentation accuracy is improved via SAM-guided mask refinement. Crucially, this work is the first to systematically uncover and exploit the temporal modeling potential embedded in diffusion models’ self-attention mechanisms for fully zero-shot, training-free video semantic propagation. On standard benchmarks including DAVIS, DRIFT achieves state-of-the-art zero-shot performance, significantly improving the robustness and consistency of cross-frame label propagation.

Technology Category

Application Category

πŸ“ Abstract
Image diffusion models, though originally developed for image generation, implicitly capture rich semantic structures that enable various recognition and localization tasks beyond synthesis. In this work, we investigate their self-attention maps can be reinterpreted as semantic label propagation kernels, providing robust pixel-level correspondences between relevant image regions. Extending this mechanism across frames yields a temporal propagation kernel that enables zero-shot object tracking via segmentation in videos. We further demonstrate the effectiveness of test-time optimization strategies-DDIM inversion, textual inversion, and adaptive head weighting-in adapting diffusion features for robust and consistent label propagation. Building on these findings, we introduce DRIFT, a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM-guided mask refinement, achieving state-of-the-art zero-shot performance on standard video object segmentation benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Image diffusion models enable zero-shot object tracking in videos
Reinterpreting self-attention maps as semantic propagation kernels
Adapting diffusion features for robust temporal label propagation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinterprets self-attention as semantic propagation kernels
Extends image diffusion to temporal video object tracking
Integrates SAM-guided mask refinement for segmentation accuracy
πŸ”Ž Similar Papers
No similar papers found.