PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

📅 2024-11-30

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 1

career value

218K/year

🤖 AI Summary

Existing text-to-video (T2V) diffusion models suffer from significant deficiencies in physical commonsense adherence and temporal dynamics modeling, primarily due to limited physical understanding and inadequate sequential modeling. Current approaches either rely on large-scale annotated datasets or require dedicated physics modules, resulting in poor generalizability. This paper proposes a **data-agnostic, model-agnostic, LLM-guided prompt enhancement framework** that enables zero-shot physically plausible video generation. Our method integrates chain-of-thought (CoT) and step-back reasoning, physics-rule-constrained prompt engineering, and native diffusion-model prompt optimization. Crucially, it requires no additional training data or architectural modifications and supports out-of-distribution generalization to novel physical scenarios. Experiments demonstrate a 2.3× improvement in physical consistency over baseline T2V models and a 35% gain over prior prompt-based methods. The code is publicly available.

Technology Category

Application Category

📝 Abstract

Text-to-video (T2V) generation has been recently enabled by transformer-based diffusion models, but current T2V models lack capabilities in adhering to the real-world common knowledge and physical rules, due to their limited understanding of physical realism and deficiency in temporal modeling. Existing solutions are either data-driven or require extra model inputs, but cannot be generalizable to out-of-distribution domains. In this paper, we present PhyT2V, a new data-independent T2V technique that expands the current T2V model's capability of video generation to out-of-distribution domains, by enabling chain-of-thought and step-back reasoning in T2V prompting. Our experiments show that PhyT2V improves existing T2V models' adherence to real-world physical rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers. The source codes are available at: https://github.com/pittisl/PhyT2V.

Problem

Research questions and friction points this paper is trying to address.

Enhancing text-to-video models' adherence to physical rules

Generalizing video generation to out-of-distribution domains

Improving realism without requiring extra model inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-guided iterative self-refinement for T2V

Chain-of-thought reasoning in video prompting

Data-independent physics-grounded video generation

🔎 Similar Papers

MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators