SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation

📅 2026-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that current text-to-video generation models often fail to accurately represent dynamic spatial relationships described in input text, leading to spatially inconsistent or logically implausible outputs. To mitigate this issue, we propose SPATIALALIGN, a framework that fine-tunes models using Direct Preference Optimization (DPO) with zeroth-order regularization to enhance their understanding and generation of dynamic spatial relations. We introduce DSR-SCORE, a geometry-based metric to quantitatively assess alignment between dynamic spatial relationships in generated videos and reference text, and present the first text-video dataset specifically curated to capture diverse dynamic spatial configurations. Experimental results demonstrate that models fine-tuned with SPATIALALIGN significantly outperform existing baselines in aligning dynamic spatial relationships, thereby improving the spatial logical fidelity of generated videos.

Technology Category

Application Category

📝 Abstract
Most text-to-video (T2V) generators prioritize aesthetic quality, but often ignoring the spatial constraints in the generated videos. In this work, we present SPATIALALIGN, a self-improvement framework that enhances T2V models capabilities to depict Dynamic Spatial Relationships (DSR) specified in text prompts. We present a zeroth-order regularized Direct Preference Optimization (DPO) to fine-tune T2V models towards better alignment with DSR. Specifically, we design DSR-SCORE, a geometry-based metric that quantitatively measures the alignment between generated videos and the specified DSRs in prompts, which is a step forward from prior works that rely on VLM for evaluation. We also conduct a dataset of text-video pairs with diverse DSRs to facilitate the study. Extensive experiments demonstrate that our fine-tuned model significantly out performs the baseline in spatial relationships. The code will be released in Link.
Problem

Research questions and friction points this paper is trying to address.

text-to-video generation
spatial relationships
dynamic spatial relationships
video alignment
spatial constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Spatial Relationships
Direct Preference Optimization
Geometry-based Evaluation
Text-to-Video Generation
Zeroth-order Regularization
🔎 Similar Papers
No similar papers found.
F
Fengming Liu
College of Computing and Data Science, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798
Tat-Jen Cham
Tat-Jen Cham
Nanyang Technological University
Computer Vision
Chuanxia Zheng
Chuanxia Zheng
PVG, Nanyang Technological University
computer visionmachine learningPhysical AISpatial AIGenerative AI