Text-Driven Video Style Transfer with State-Space Models: Extending StyleMamba for Temporal Coherence

๐Ÿ“… 2025-03-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses key challenges in text-driven video style transferโ€”namely, inter-frame style inconsistency, motion jitter, and poor robustness under scene transitions. To this end, we propose a temporal-aware approach grounded in State Space Models (SSMs). Our method introduces three core innovations: (1) a Video State Space Fusion module that jointly models spatiotemporal features across frames; (2) a Temporal Masked Directional Loss to enforce style consistency in dynamically occluded regions; and (3) a Temporal Second-Order Loss that regularizes the second-order differences of optical flow to enhance motion smoothness. Evaluated on DAVIS and UCF101, our method achieves significant improvements in style consistency (+12.6% FID) and visual temporal coherence. Moreover, it outperforms mainstream diffusion- and Transformer-based baselines in computational efficiency, enabling near-real-time video stylization.

Technology Category

Application Category

๐Ÿ“ Abstract
StyleMamba has recently demonstrated efficient text-driven image style transfer by leveraging state-space models (SSMs) and masked directional losses. In this paper, we extend the StyleMamba framework to handle video sequences. We propose new temporal modules, including a emph{Video State-Space Fusion Module} to model inter-frame dependencies and a novel emph{Temporal Masked Directional Loss} that ensures style consistency while addressing scene changes and partial occlusions. Additionally, we introduce a emph{Temporal Second-Order Loss} to suppress abrupt style variations across consecutive frames. Our experiments on DAVIS and UCF101 show that the proposed approach outperforms competing methods in terms of style consistency, smoothness, and computational efficiency. We believe our new framework paves the way for real-time text-driven video stylization with state-of-the-art perceptual results.
Problem

Research questions and friction points this paper is trying to address.

Extend StyleMamba for text-driven video style transfer
Ensure temporal coherence in video style transfer
Improve style consistency and smoothness in video sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video State-Space Fusion Module for inter-frame dependencies
Temporal Masked Directional Loss for style consistency
Temporal Second-Order Loss to suppress abrupt variations
๐Ÿ”Ž Similar Papers
C
Chao Li
Dept. of Computer Science, Eastern Asia Institute of Technology , Beijing, China
Minsu Park
Minsu Park
NYU Abu Dhabi
CultureConsumptionSocial NetworksComputational Social Science
Cristina Rossi
Cristina Rossi
School of Computing, University of South Europe, Rome, Italy
Z
Zhuang Li
Department of Data Science, Western Frontier University, Califor nia, USA