SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the challenge that current text-to-video generation models often fail to accurately represent dynamic spatial relationships described in input text, leading to spatially inconsistent or logically implausible outputs. To mitigate this issue, we propose SPATIALALIGN, a framework that fine-tunes models using Direct Preference Optimization (DPO) with zeroth-order regularization to enhance their understanding and generation of dynamic spatial relations. We introduce DSR-SCORE, a geometry-based metric to quantitatively assess alignment between dynamic spatial relationships in generated videos and reference text, and present the first text-video dataset specifically curated to capture diverse dynamic spatial configurations. Experimental results demonstrate that models fine-tuned with SPATIALALIGN significantly outperform existing baselines in aligning dynamic spatial relationships, thereby improving the spatial logical fidelity of generated videos.

Technology Category

Application Category

📝 Abstract

Most text-to-video (T2V) generators prioritize aesthetic quality, but often ignoring the spatial constraints in the generated videos. In this work, we present SPATIALALIGN, a self-improvement framework that enhances T2V models capabilities to depict Dynamic Spatial Relationships (DSR) specified in text prompts. We present a zeroth-order regularized Direct Preference Optimization (DPO) to fine-tune T2V models towards better alignment with DSR. Specifically, we design DSR-SCORE, a geometry-based metric that quantitatively measures the alignment between generated videos and the specified DSRs in prompts, which is a step forward from prior works that rely on VLM for evaluation. We also conduct a dataset of text-video pairs with diverse DSRs to facilitate the study. Extensive experiments demonstrate that our fine-tuned model significantly out performs the baseline in spatial relationships. The code will be released in Link.

Problem

Research questions and friction points this paper is trying to address.

text-to-video generation

spatial relationships

dynamic spatial relationships

video alignment

spatial constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Spatial Relationships

Direct Preference Optimization

Geometry-based Evaluation

Text-to-Video Generation

Zeroth-order Regularization

🔎 Similar Papers

No similar papers found.