PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

📅 2024-11-30
🏛️ arXiv.org
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
Existing text-to-video (T2V) diffusion models suffer from significant deficiencies in physical commonsense adherence and temporal dynamics modeling, primarily due to limited physical understanding and inadequate sequential modeling. Current approaches either rely on large-scale annotated datasets or require dedicated physics modules, resulting in poor generalizability. This paper proposes a **data-agnostic, model-agnostic, LLM-guided prompt enhancement framework** that enables zero-shot physically plausible video generation. Our method integrates chain-of-thought (CoT) and step-back reasoning, physics-rule-constrained prompt engineering, and native diffusion-model prompt optimization. Crucially, it requires no additional training data or architectural modifications and supports out-of-distribution generalization to novel physical scenarios. Experiments demonstrate a 2.3× improvement in physical consistency over baseline T2V models and a 35% gain over prior prompt-based methods. The code is publicly available.

Technology Category

Application Category

📝 Abstract
Text-to-video (T2V) generation has been recently enabled by transformer-based diffusion models, but current T2V models lack capabilities in adhering to the real-world common knowledge and physical rules, due to their limited understanding of physical realism and deficiency in temporal modeling. Existing solutions are either data-driven or require extra model inputs, but cannot be generalizable to out-of-distribution domains. In this paper, we present PhyT2V, a new data-independent T2V technique that expands the current T2V model's capability of video generation to out-of-distribution domains, by enabling chain-of-thought and step-back reasoning in T2V prompting. Our experiments show that PhyT2V improves existing T2V models' adherence to real-world physical rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers. The source codes are available at: https://github.com/pittisl/PhyT2V.
Problem

Research questions and friction points this paper is trying to address.

Enhancing text-to-video models' adherence to physical rules
Generalizing video generation to out-of-distribution domains
Improving realism without requiring extra model inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-guided iterative self-refinement for T2V
Chain-of-thought reasoning in video prompting
Data-independent physics-grounded video generation
🔎 Similar Papers
No similar papers found.
Q
Qiyao Xue
University of Pittsburgh
X
Xiangyu Yin
University of Pittsburgh
Boyuan Yang
Boyuan Yang
University of Pittsburgh
W
Wei Gao
University of Pittsburgh