Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the challenge of simultaneously achieving semantic controllability and identity consistency in text-to-video generation. To this end, the authors propose the ST-DRC framework, which effectively integrates textual semantics and reference identity information within a diffusion model through latent contextual feature injection and a spatiotemporally disentangled reference conditioning mechanism. Key innovations include the design of TASS-RoPE for spatiotemporally disentangled reference guidance, the incorporation of appearance-invariant augmentation and face-guided objectives, and a novel three-stream classifier-free guidance strategy. Built upon the LTX-2.3 architecture and integrating a video VAE, spatiotemporal attention, and RoPE positional encoding, the method achieves state-of-the-art performance in identity preservation while maintaining strong text alignment, temporal coherence, and overall video quality.

📝 Abstract

Identity-preserving video generation (IPVG) aims to synthesize high-fidelity videos that follow text prompts while faithfully preserving a reference identity. Despite recent progress, existing IPVG methods still struggle to balance high-level semantic control and low-level identity fidelity. To bridge this gap, we propose ST-DRC, an effective Spatial-Temporal Decoupled Reference Conditioning framework for identity-preserving text-to-video generation. At the framework level, ST-DRC performs latent in-context feature injection by encoding the reference image with the video VAE and concatenating it with noisy video latents, enabling rich low-level identity details to be accessed without additional adapters. To separate identity-aware reference retrieval from appearance copying, we introduce TASS-RoPE, a Temporal-Adjacent Spatial-Shifted RoPE scheme that places reference tokens near the video sequence in time but shifts them in space, allowing reference information to flow through spatio-temporal attention while suppressing pixel-level copy-paste shortcuts. To further prevent shortcut learning and strengthen the otherwise diluted identity supervision in the diffusion objective, we combine appearance-invariant reference augmentation with face-guided identity objectives, encouraging the model to preserve identity under variations in color, pose, and layout. At inference time, we introduce a three-stream reference classifier-free guidance strategy that independently controls text adherence and reference fidelity. Experiments demonstrate that ST-DRC achieves strong identity preservation, prompt alignment, temporal consistency, and video quality with a lightweight design built on LTX-2.3. Our method ranks among the top submissions in the facial identity-preserving video generation track, validating the effectiveness of spatial-temporal decoupled reference conditioning.

Problem

Research questions and friction points this paper is trying to address.

Identity-Preserving Video Generation

Text-to-Video Generation

Spatial-Temporal Conditioning

Reference Identity Preservation

Diffusion Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial-Temporal Decoupling

Reference Conditioning

Identity-Preserving Video Generation